激情久久久_欧美视频区_成人av免费_不卡视频一二三区_欧美精品在欧美一区二区少妇_欧美一区二区三区的

腳本之家,腳本語(yǔ)言編程技術(shù)及教程分享平臺(tái)!
分類(lèi)導(dǎo)航

Python|VBS|Ruby|Lua|perl|VBA|Golang|PowerShell|Erlang|autoit|Dos|bat|

服務(wù)器之家 - 腳本之家 - Python - python網(wǎng)絡(luò)爬蟲(chóng)精解之Beautiful Soup的使用說(shuō)明

python網(wǎng)絡(luò)爬蟲(chóng)精解之Beautiful Soup的使用說(shuō)明

2022-01-12 00:39小狐貍夢(mèng)想去童話鎮(zhèn) Python

簡(jiǎn)單來(lái)說(shuō),Beautiful Soup 是 python 的一個(gè)庫(kù),最主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù),Beautiful Soup 提供一些簡(jiǎn)單的、python 式的函數(shù)用來(lái)處理導(dǎo)航、搜索、修改分析樹(shù)等功能,需要的朋友可以參考下

一、Beautiful Soup的介紹

Beautiful Soup是一個(gè)強(qiáng)大的解析工具,它借助網(wǎng)頁(yè)結(jié)構(gòu)和屬性等特性來(lái)解析網(wǎng)頁(yè)。

它提供一些函數(shù)來(lái)處理導(dǎo)航、搜索、修改分析樹(shù)等功能,Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時(shí)實(shí)際上需要依賴(lài)解析器,常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html測(cè)試實(shí)例:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />
    <title>百度一下,你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新聞 </a>
            <a class="mnav" href="https://www.hao123.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>
            <a class="mnav" href="http://map.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地圖 </a>
            <a class="mnav" href="http://v.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">視頻 </a>
            <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">貼吧 </a>
            <a class="bri" href="//www.baidu.com/more/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

1、節(jié)點(diǎn)選擇器

我們之前了解到,一個(gè)網(wǎng)頁(yè)是由若干個(gè)元素節(jié)點(diǎn)組成的,通過(guò)提取某個(gè)節(jié)點(diǎn)的具體內(nèi)容,就可以獲取到界面呈現(xiàn)的一些數(shù)據(jù)。使用節(jié)點(diǎn)選擇器能夠簡(jiǎn)化我們獲取數(shù)據(jù)的過(guò)程,在不使用正則表達(dá)式的前提下,精準(zhǔn)的獲取數(shù)據(jù)。

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)

【運(yùn)行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>

分析:

第一條打印數(shù)據(jù)為獲取網(wǎng)頁(yè)的head節(jié)點(diǎn);

第二條打印內(nèi)容是獲取head節(jié)點(diǎn)中title節(jié)點(diǎn),獲取該節(jié)點(diǎn)使用了一個(gè)嵌套選擇,因?yàn)閠itle節(jié)點(diǎn)是嵌套在head節(jié)點(diǎn)里面的;

第三條打印內(nèi)容是獲取a節(jié)點(diǎn),在源碼中我們看到有許多條a節(jié)點(diǎn),而只匹配到第一個(gè)a節(jié)點(diǎn)就結(jié)束了。當(dāng)有多個(gè)節(jié)點(diǎn)時(shí),這種選擇方式指只會(huì)選擇第一個(gè)匹配的節(jié)點(diǎn),其他后面節(jié)點(diǎn)會(huì)忽略。

2、提取信息

一般我們需要的數(shù)據(jù)位于節(jié)點(diǎn)名、屬性值、文本值中,以下代碼展示了如何獲取這三個(gè)地方的數(shù)據(jù):

?
1
2
3
4
5
6
7
8
9
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)

【運(yùn)行結(jié)果】

body
['mnav']
http://news.baidu.com
新聞

分析:

第一條獲取body節(jié)點(diǎn)名;

第二條獲取a節(jié)點(diǎn)class屬性值;

第三條獲取a節(jié)點(diǎn)href屬性值;

第四條獲取a節(jié)點(diǎn)的文本值;

3、關(guān)聯(lián)選擇

(1)子節(jié)點(diǎn)和子孫節(jié)點(diǎn)

子節(jié)點(diǎn)可以調(diào)用contents屬性和children屬性,子孫節(jié)點(diǎn)可以調(diào)用descendants屬性,他們返回結(jié)果都是生成器類(lèi)型,通過(guò)for循環(huán)輸出匹配到的信息。

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
    print(i,content)

【運(yùn)行結(jié)果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
</div>
</div>
</div>
</div>
2

(2)父節(jié)點(diǎn)和祖先節(jié)點(diǎn)

獲取某個(gè)節(jié)點(diǎn)的父節(jié)點(diǎn)可以調(diào)用parent屬性,例如獲取實(shí)例中title節(jié)點(diǎn)的父節(jié)點(diǎn):

?
1
2
3
4
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)

【運(yùn)行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>

同理,如果是想要獲取節(jié)點(diǎn)的祖先節(jié)點(diǎn),則可調(diào)用parents屬性。

(3)兄弟節(jié)點(diǎn)

調(diào)用next_sibling獲取節(jié)點(diǎn)的下一個(gè)兄弟元素;

調(diào)用previous_sibling獲取節(jié)點(diǎn)的上一個(gè)兄弟元素;

調(diào)用next_siblings取節(jié)點(diǎn)的下一個(gè)兄弟節(jié)點(diǎn);

調(diào)用previous_siblings獲取節(jié)點(diǎn)的上一個(gè)兄弟節(jié)點(diǎn);

4、方法選擇器

find_all()

查找所有符合條件的元素,其使用方法如下:

?
1
find_all(name,attrs,recursive,text,**kwargs)

(1)name

根據(jù)節(jié)點(diǎn)名來(lái)查詢(xún)?cè)兀绮樵?xún)實(shí)例中a標(biāo)簽元素:

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
    print(a)

【運(yùn)行結(jié)果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

(2)attrs

在查詢(xún)時(shí)我們還可以傳入標(biāo)簽的屬性,attrs參數(shù)的數(shù)據(jù)類(lèi)型是字典。

?
1
2
3
4
5
6
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【運(yùn)行結(jié)果】

[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]

可以看到,在加上class=“bri”屬性時(shí),查詢(xún)結(jié)果就只剩一條a標(biāo)簽元素。

(3)text

text參數(shù)可以用來(lái)匹配節(jié)點(diǎn)的文本,傳入的可以是字符串,也可以是正則表達(dá)式對(duì)象。

?
1
2
3
4
5
6
7
import re
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新聞')))

【運(yùn)行結(jié)果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]

只包含文本內(nèi)容為“新聞”的a標(biāo)簽。

find()

find()的使用與前者相似,唯一不同的是,find進(jìn)匹配搜索到的第一個(gè)元素,然后返回單個(gè)元素,find_all()則是匹配所有符合條件的元素,返回一個(gè)列表。

5、CSS選擇器

使用CSS選擇器時(shí),調(diào)用select()方法,傳入相應(yīng)的CSS選擇器;

例如使用CSS選擇器獲取實(shí)例中的a標(biāo)簽

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
    print(a)

【運(yùn)行結(jié)果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

獲取屬性

獲取上述a標(biāo)簽中的href屬性

?
1
2
3
4
5
6
7
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a['href'])

【運(yùn)行結(jié)果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

獲取文本

獲取上述a標(biāo)簽的文本內(nèi)容,使用get_text()方法,或者是string獲取文本內(nèi)容

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a.get_text())
    print(a.string)

【運(yùn)行結(jié)果】

新聞
新聞
hao123
hao123
地圖
地圖
視頻
視頻
貼吧
貼吧
更多產(chǎn)品
更多產(chǎn)品

到此這篇關(guān)于python網(wǎng)絡(luò)爬蟲(chóng)精解之Beautiful Soup的使用說(shuō)明的文章就介紹到這了,更多相關(guān)python Beautiful Soup 內(nèi)容請(qǐng)搜索服務(wù)器之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持服務(wù)器之家!

原文鏈接:https://blog.csdn.net/gets_s/article/details/120372061

延伸 · 閱讀

精彩推薦
主站蜘蛛池模板: caoporn国产一区二区 | 欧美一级毛片免费观看视频 | 久久6国产 | 海外中文字幕在线观看 | 亚洲生活片 | 一边吃奶一边摸下娇喘 | 黄色一级片免费在线观看 | 欧美成人一区二区视频 | 毛片一级片 | 一区二区三区日韩在线 | 国产毛片视频 | 一区二区三区黄色 | 斗破苍穹在线观看免费完整观看 | 国人精品视频在线观看 | 欧美a级大胆视频 | 激情宗合 | 免费欧美一级视频 | 国内精品久久久久久久久久久久 | av免播放 | 成人性视频欧美一区二区三区 | 日操操夜操操 | 可以免费看av| 黄色特级片黄色特级片 | 欧美a视频在线观看 | 黄色网址在线播放 | www国产成人免费观看视频 | av免费av| 4p一女两男做爰在线观看 | www.99tv| 最近中文字幕一区二区 | 自拍偷拍亚洲图片 | 久久精品网| 国产一区二区午夜 | 二级大黄大片高清在线视频 | 久久精品亚洲一区二区三区观看模式 | 久久亚洲精品久久国产一区二区 | 国产日产精品久久久久快鸭 | 欧美精品成人一区二区三区四区 | 51色视频| 小视频免费在线观看 | 久久九九热re6这里有精品 |