前言
Splash是一個(gè)javascript渲染服務(wù)。它是一個(gè)帶有HTTP API的輕量級(jí)Web瀏覽器,使用Twisted和QT5在Python 3中實(shí)現(xiàn)。QT反應(yīng)器用于使服務(wù)完全異步,允許通過(guò)QT主循環(huán)利用webkit并發(fā)。
一些Splash功能:
- 并行處理多個(gè)網(wǎng)頁(yè)
- 獲取HTML源代碼或截取屏幕截圖
- 關(guān)閉圖像或使用Adblock Plus規(guī)則使渲染更快
- 在頁(yè)面上下文中執(zhí)行自定義JavaScript
- 可通過(guò)Lua腳本來(lái)控制頁(yè)面的渲染過(guò)程
- 在Splash-Jupyter 筆記本中開(kāi)發(fā)Splash Lua腳本。
- 以HAR格式獲取詳細(xì)的渲染信息
1、Scrapy-Splash的安裝
Scrapy-Splash的安裝分為兩部分,一個(gè)是Splash服務(wù)的安裝,具體通過(guò)Docker來(lái)安裝服務(wù),運(yùn)行服務(wù)會(huì)啟動(dòng)一個(gè)Splash服務(wù),通過(guò)它的接口來(lái)實(shí)現(xiàn)JavaScript頁(yè)面的加載;另外一個(gè)是Scrapy-Splash的Python庫(kù)的安裝,安裝后就可在Scrapy中使用Splash服務(wù)了,下面我們分三部份來(lái)安裝:
(1)安裝Docker
1
2
3
4
5
6
7
8
9
10
|
#安裝所需要的包: yum install -y yum-utils device-mapper-persistent-data lvm2 #設(shè)置穩(wěn)定存儲(chǔ)庫(kù): yum-config-manager --add-repo https: //download .docker.com /linux/centos/docker-ce .repo #開(kāi)始安裝DOCKER CE: yum install docker-ce #啟動(dòng)dockers: systemctl start docker #測(cè)試安裝是否正確: docker run hello-world |
(2)安裝splash服務(wù)
通過(guò)Docker安裝Scrapinghub/splash鏡像,然后啟動(dòng)容器,創(chuàng)建splash服務(wù)
1
2
3
|
docker pull scrapinghub /splash docker run -d -p 8050:8050 scrapinghub /splash #通過(guò)瀏覽器訪問(wèn)8050端口驗(yàn)證安裝是否成功 |
(3)Python包Scrapy-Splash安裝
1
|
pip3 install scrapy-splash |
2、Splash Lua腳本
運(yùn)行splash服務(wù)后,通過(guò)web頁(yè)面訪問(wèn)服務(wù)的8050端口如:http://localhost:8050即可看到其web頁(yè)面,如下圖:
上面有個(gè)輸入框,默認(rèn)是http://google.com,我們可以換成想要渲染的網(wǎng)頁(yè)如:https://www.baidu.com然后點(diǎn)擊Render me按鈕開(kāi)始渲染,頁(yè)面返回結(jié)果包括渲染截圖、HAR加載統(tǒng)計(jì)數(shù)據(jù)、網(wǎng)頁(yè)源代碼:
從HAR中可以看到,Splash執(zhí)行了整個(gè)頁(yè)面的渲染過(guò)程,包括CSS、JavaScript的加載等,通過(guò)返回結(jié)果可以看到它分別對(duì)應(yīng)搜索框下面的腳本文件中return部分的三個(gè)返回值,html、png、har:
1
2
3
4
5
6
7
8
9
|
function main(splash, args) assert (splash:go(args.url)) assert (splash:wait( 0.5 )) return { html = splash:html(), png = splash:png(), har = splash:har(), } end |
這個(gè)腳本是使用Lua語(yǔ)言寫(xiě)的,它首先使用go()方法加載頁(yè)面,wait()方法等待加載時(shí)間,然后返回源碼、截圖和HAR信息。
現(xiàn)在我們修改下它的原腳本,訪問(wèn)www.baidu.com,通過(guò)javascript腳本,讓它返回title,然后執(zhí)行:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
function main(splash, args) assert (splash:go( "https://www.baidu.com" )) assert (splash:wait( 0.5 )) local title = splash:evaljs( "document.title" ) return { title = title } end #返回結(jié)果: Splash Response: Object title: "百度一下,你就知道" |
由此可以確定Splash渲染頁(yè)面的過(guò)程是通過(guò)此入口腳本來(lái)實(shí)現(xiàn)的,那么我們可以修改此腳本來(lái)滿足我們對(duì)抓取頁(yè)面的分析和結(jié)果返回,但此函數(shù)但名稱必須是main(),它返回的結(jié)果是一個(gè)字典形式也可以返回字符串形式的內(nèi)容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
function main(splash) return { hello = "world" } end #返回結(jié)果 Splash Response: Object hello: "world" function main(splash) return "world" end #返回結(jié)果 Splash Response: "world" |
3、Splash對(duì)象的屬性與方法
在前面的例子中,main()方法的第一參數(shù)是splash,這個(gè)對(duì)象它類似于selenium中的WebDriver對(duì)象,可以調(diào)用它的屬性和方法來(lái)控制加載規(guī)程,下面介紹一些常用的屬性:
splash.args:該屬性可以獲取加載時(shí)陪在的參數(shù),如URL,如果為GET請(qǐng)求,它可以獲取GET請(qǐng)求參數(shù),如果為POST請(qǐng)求,它可以獲取表單提交的數(shù)據(jù),splash.args可以使用函數(shù)的第二個(gè)可選參數(shù)args來(lái)進(jìn)行訪問(wèn)
1
2
3
4
5
6
7
8
9
|
function main(splash,args) local url = args.url end #上面的第二個(gè)參數(shù)args就相當(dāng)于splash.args屬性,如下代碼與上面是等價(jià)的 function main(splash) local url = splash.args.url end |
splash.js_enabled:?jiǎn)⒂没蛘呓庙?yè)面中嵌入的JavaScript代碼的執(zhí)行,默認(rèn)為true,啟用JavaScript執(zhí)行
splash.resource_timeout:設(shè)置網(wǎng)絡(luò)請(qǐng)求的默認(rèn)超時(shí),以秒為單位,如設(shè)置為0或nil則表示無(wú)超時(shí):splash.resource_timeout=nil
splash.images_enabled:?jiǎn)⒂没蚪脠D片加載,默認(rèn)情況下是加載的:splash.images_enabled=true
splash.plugins_enabled:?jiǎn)⒂没蚪脼g覽器插件,默認(rèn)為禁止:splash.plugins_enabled=false
splash.scroll_position:獲取和設(shè)置主窗口的當(dāng)前位置:splash.scroll_position={x=50,y=600}
1
2
3
4
5
6
7
8
9
|
function main(splash, args) assert (splash:go( 'https://www.toutiao.com' )) splash.scroll_position = {y = 400 } return { png = splash:png() } end #它會(huì)向下滾動(dòng)400像素來(lái)獲取圖片 |
splash.html5_media_enabled: 啟用或禁用HTML5媒體,包括HTML5視頻和音頻(例如<video>元素播放)
splash對(duì)象的方法:
splash:go() :該方法用來(lái)請(qǐng)求某個(gè)鏈接,而且它可以模擬GET和POST請(qǐng)求,同時(shí)支持傳入請(qǐng)求頭,表單等數(shù)據(jù),用法如下:
ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}
參數(shù)說(shuō)明:url為請(qǐng)求的URL,baseurl為可選參數(shù)表示資源加載相對(duì)路徑,headers為可選參數(shù),表示請(qǐng)求頭,http_method表示http請(qǐng)求方法的字符串默認(rèn)為GET,body為使用POST時(shí)發(fā)送表單數(shù)據(jù),使用的Content-type為application/json,formdata默認(rèn)為空,POST請(qǐng)求時(shí)的表單數(shù)據(jù),使用的Content-type為application/x-www-form-urlencoded
該方法返回結(jié)果是ok和reason的組合,如果ok為空則代表網(wǎng)頁(yè)加載錯(cuò)誤,reason變量中會(huì)包含錯(cuò)誤信息
1
2
3
4
5
6
|
function main(splash, args) local ok, reason = splash:go{ "http://httpbin.org/post" , http_method = "POST" , body = "name=Germey" } if ok then return splash:html() end end |
splash.wait() :控制頁(yè)面的等待時(shí)間
ok, reason = splash:wait{time, cancel_on_redirect=false, cancel_on_error=true}
tiem為等待的秒數(shù),cancel_on_redirect表示發(fā)生重定向就停止等待,并返回重定向結(jié)果,默認(rèn)為false,cancel_on_error默認(rèn)為false,表示如果發(fā)生錯(cuò)誤就停止等待
返回結(jié)果同樣是ok和reason的組合
1
2
3
4
5
6
7
8
9
10
|
function main(splash, args) splash:go( "https://www.toutiao.com" ) local ok reason = splash:wait( 1 ) return { ok = ok, reason = reason } end #返回true說(shuō)明返回頁(yè)面成功 |
splash:jsfunc()
lua_func = splash:jsfunc(func)
此方法可以直接調(diào)用JavaScript定義的函數(shù),但所調(diào)用的函數(shù)需要用雙中括號(hào)包圍,它相當(dāng)于實(shí)現(xiàn)了JavaScript方法到Lua腳本到轉(zhuǎn)換,全局的JavaScript函數(shù)可以直接包裝
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
function main(splash, args) local get_div_count = splash:jsfunc([[ function () { var body = document.body; var divs = body.getElementsByTagName( 'div' ); return divs.length; } ]]) splash:go( "https://www.baidu.com" ) return ( "There are %s DIVs" ): format ( get_div_count()) end # Splash Response: "There are 21 DIVs" |
splash.evaljs() :在頁(yè)面上下文中執(zhí)行JavaScript代碼段并返回最后一個(gè)語(yǔ)句的結(jié)果
1
2
3
|
local title = splash:evaljs( "document.title" ) #返回頁(yè)面標(biāo)題 |
splash:runjs() :在頁(yè)面上下文中運(yùn)行JavaScript代碼,同evaljs差不多,但它更偏向于執(zhí)行某些動(dòng)作或聲明函數(shù)
1
2
3
4
5
6
|
function main(splash, args) splash:go( "https://www.baidu.com" ) splash:runjs( "foo = function() { return 'bar' }" ) local result = splash:evaljs( "foo()" ) return result end |
splash:autoload() :將JavaScript設(shè)置為在每個(gè)頁(yè)面加載時(shí)自動(dòng)加載
ok, reason = splash:autoload{source_or_url, source=nil, url=nil}
參數(shù):
- source_or_url - 包含JavaScript源代碼的字符串或用于加載JavaScript代碼的URL;
- source - 包含JavaScript源代碼的字符串;
- url - 從中??加載JavaScript源代碼的URL
此方法只加載JavaScript代碼或庫(kù),不執(zhí)行操作,如果要執(zhí)行操作可以調(diào)用evaljs()或runjs()方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
function main(splash, args) splash:autoload([[ function get_document_title(){ return document.title; } ]]) splash:go( "https://www.baidu.com" ) return splash:evaljs( "get_document_title()" ) end #加載JS庫(kù)文件 function main(splash, args) assert (splash:autoload( "https://code.jquery.com/jquery-2.1.3.min.js" )) assert (splash:go( "https://www.taobao.com" )) local version = splash:evaljs( "$.fn.jquery" ) return 'JQuery version: ' .. version end |
splash:call_later :通過(guò)設(shè)置定時(shí)任務(wù)和延遲時(shí)間來(lái)實(shí)現(xiàn)任務(wù)延時(shí)執(zhí)行
timer = splash:call_later(callback, delay) :callback運(yùn)行的函數(shù),delay延遲時(shí)間
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
function main(splash, args) local snapshots = {} local timer = splash:call_later(function() snapshots[ "a" ] = splash:png() splash.scroll_position = {y = 500 } splash:wait( 1.0 ) snapshots[ "b" ] = splash:png() end, 2 ) splash:go( "https://www.toutiao.com" ) splash:wait( 3.0 ) return snapshots end #等待2秒后執(zhí)行截圖然后再等待3秒后執(zhí)行截圖 |
splash:http_get() :發(fā)送HTTP GET請(qǐng)求并返回相應(yīng)
response = splash:http_get{url, headers=nil, follow_redirects=true} :url要加載的URL,headers添加HTTP頭,follw_redirects是否啟動(dòng)自動(dòng)重定向默認(rèn)為true
1
2
3
|
local reply = splash:http_get( "http://example.com" ) #返回一個(gè)響應(yīng)對(duì)象,不會(huì)講結(jié)果返回到瀏覽器 |
splash:http_post :發(fā)送POST請(qǐng)求
response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil}
dody指定表單數(shù)據(jù)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
function main(splash, args) local treat = require( "treat" ) local json = require( "json" ) local response = splash:http_post{ "http://httpbin.org/post" , body = json.encode({name = "Germey" }), headers = {[ "content-type" ] = "application/json" } } return { html = treat.as_string(response.body), url = response.url, status = response.status } end # html:{ "args" :{}, "data" : "{\"name\": \"Germey\"}" , "files" :{}, "form" :{}, "headers" :{ "Accept-Encoding" : "gzip, deflate" , "Accept-Language" : "en,*" , "Connection" : "close" , "Content-Length" : "18" , "Content-Type" : "application/json" , "Host" : "httpbin.org" , "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1" }, "json" :{ "name" : "Germey" }, "origin" : "221.218.181.223" , "url" : "http://httpbin.org/post" } status: 200 url: <a href = "http://httpbin.org/post" >http: / / httpbin.org / post< / a> |
splash:set_content() :設(shè)置當(dāng)前頁(yè)面的內(nèi)容
ok, reason = splash:set_content{data, mime_type="text/html; charset=utf-8", baseurl=""}
1
2
3
4
|
function main(splash) assert (splash:set_content( "<html><body><h1>hello</h1></body></html>" )) return splash:png() end |
splash:html() :獲取網(wǎng)頁(yè)的源代碼,結(jié)果為字符串
1
2
3
4
|
function main(splash, args) splash:go( "https://httpbin.org/get" ) return splash:html() end |
splash:png() :獲取PNG格式的網(wǎng)頁(yè)截圖
splash:jpeg() :獲取JPEG格式的網(wǎng)頁(yè)截圖
splash:har() :獲取頁(yè)面加載過(guò)程描述
splash:url() :獲取當(dāng)前正在訪問(wèn)的URL
splash:get_cookies() :獲取當(dāng)前頁(yè)面的cookies
splash:add_cookie() :為當(dāng)前頁(yè)面添加cookie
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
function main(splash) splash:add_cookie{ "sessionid" , "237465ghgfsd" , "/" , domain = "http://example.com" } splash:go( "http://example.com/" ) return splash:get_cookies() end # Splash Response: Array[ 1 ] 0 : Object domain: "http://example.com" httpOnly: false name: "sessionid" path: "/" secure: false value: "237465ghgfsd" |
splash:clear_cookies() :清除所有的cookies
splash:delete_cookies{name=nil,url=nil} 刪除指定的cookie
splash:get_viewport_size() :獲取當(dāng)前瀏覽器頁(yè)面的大小,即寬高
splash:set_viewport_size(width,height) :設(shè)置當(dāng)前瀏覽器頁(yè)面的大小,即寬高
splash:set_viewport_full() :設(shè)置瀏覽器全屏顯示
splash:set_user_agent() :覆蓋設(shè)置請(qǐng)求頭的User-Agent
splash:get_custom_headers(headers) :設(shè)置請(qǐng)求頭
1
2
3
4
5
6
7
8
|
function main(splash) splash:set_custom_headers({ [ "User-Agent" ] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" , [ "Site" ] = "httpbin.org" , }) splash:go( "http://httpbin.org/get" ) return splash:html() end |
splash:on_request(callback) :在HTTP請(qǐng)求之前注冊(cè)要調(diào)用的函數(shù)
splash:get_version() :獲取splash版本信息
splash:mouse_press() :觸發(fā)鼠標(biāo)按下事件
splash:mouse_release() :觸發(fā)鼠標(biāo)釋放事件
splash:send_keys() :發(fā)送鍵盤事件到頁(yè)面上下文,如發(fā)送回車鍵:splash:send_keys("key_Enter")
splash:send_text() :將文本內(nèi)容發(fā)送到頁(yè)面上下文
splash:select() :選中符合條件的第一個(gè)節(jié)點(diǎn),如果有多個(gè)節(jié)點(diǎn)符合條件,則只會(huì)返回一個(gè),其參數(shù)是CSS選擇器
1
2
3
4
5
6
7
|
function main(splash) splash:go( "https://www.baidu.com/" ) input = splash:select( "#kw" ) input :send_text( 'Splash' ) splash:wait( 3 ) return splash:png() end |
splash:select_all() :選中所有符合條件的節(jié)點(diǎn),其參數(shù)是CSS選擇器
1
2
3
4
5
6
7
8
9
10
11
12
13
|
function main(splash) local treat = require( 'treat' ) assert (splash:go( "https://www.zhihu.com" )) assert (splash:wait( 1 )) local texts = splash:select_all( '.ContentLayout-mainColumn .ContentItem-title' ) local results = {} for index, text in ipairs(texts) do results[index] = text.node.textContent end return treat.as_array(results) end #返回所有節(jié)點(diǎn)下的文本內(nèi)容 |
splash:mouse_click() :出發(fā)鼠標(biāo)單擊事件
1
2
3
4
5
6
7
8
9
|
function main(splash) splash:go( "https://www.baidu.com/" ) input = splash:select( "#kw" ) input :send_text( 'Splash' ) submit = splash:select( '#su' ) submit:mouse_click() splash:wait( 3 ) return splash:png() end |
其他splash scripts的屬性與方法請(qǐng)參考官方文檔:http://splash.readthedocs.io/en/latest/scripting-ref.html#splash-args
4、響應(yīng)對(duì)象
響應(yīng)對(duì)象是由splash方法返回的回調(diào)信息,如splash:http_get()或splash:http_post(),會(huì)被傳遞給回調(diào)splash:on_response和splash:on_response_headers,它們包括的響應(yīng)信息:
response.url:響應(yīng)的URL
response.status:響應(yīng)的HTTP狀態(tài)碼
response.ok:成功返回true否則返回false
response.headers:返回HTTP頭信息
response.info:具有HAR響應(yīng)格式的響應(yīng)數(shù)據(jù)表
response.body:返回原始響應(yīng)主體信息為二進(jìn)制對(duì)象,需要使用treat.as_string轉(zhuǎn)換為字符串
resonse.request:響應(yīng)的請(qǐng)求對(duì)象
response.abort:終止響應(yīng)
5、元素對(duì)象
元素對(duì)象包裝JavaScript DOM節(jié)點(diǎn),創(chuàng)建某個(gè)方法返回任何類型的DOM節(jié)點(diǎn),如Node,Element,HTMLElement等,splash:select和splash:select_all將返回元素對(duì)象
element:mouse_click() 出發(fā)元素上的鼠標(biāo)單擊事件
element:mouse_hover()在元素上觸發(fā)鼠標(biāo)懸停事件
elemnet:styles() 返回元素的計(jì)算樣式
element:bounds() 返回元素的邊界客戶端矩形
element:png()以PNG格式返回元素的屏幕截圖
element:jpeg() 以JPEG格式返回元素的屏幕截圖
element:visible() 檢查元素是否可見(jiàn)
element:focused() 檢查元素是否具有焦點(diǎn)
element:text() 從元素中獲取文本信息
element:info() 獲取元素的詳細(xì)信息
element:field_value() 獲取field元素的值,如input,select,textarea,button
element:form_values(values='auto'/'list'/'first') 如果元素類型是表單,則返回帶有表單的表,返回類型有三種格式
element:fill(values) 使用提供的值填寫(xiě)表單
element:send_keys(keys) 將鍵盤事件發(fā)送到元素,如發(fā)送回車send_keys('key_Enter'),其他鍵請(qǐng)參考:http://doc.qt.io/qt-5/qt.html#
element:send_text() 發(fā)送字符串到元素
element:submit()提交表單元素
element:exists()檢查DOM中元素是否存在
element屬性:
element.node 它具有所有公開(kāi)的元素DOM方法和屬性,但不包括splash定義的方法和屬性
element.inner_id 表示元素ID
外部繼承的支持的DOM屬性:(有一些是只讀的)
從HTMLElement繼承的屬性:
- accessKey
- accessKeyLabel (read-only)
- contentEditable
- isContentEditable (read-only)
- dataset (read-only)
- dir
- draggable
- hidden
- lang
- offsetHeight (read-only)
- offsetLeft (read-only)
- offsetParent (read-only)
- offsetTop (read-only)
- spellcheck
- style - a table with styles which can be modified
- tabIndex
- title
- translate
從 Element繼承的屬性:
- attributes (read-only) - a table with attributes of the element
- classList (read-only) - a table with class names of the element
- className
- clientHeight (read-only)
- clientLeft (read-only)
- clientTop (read-only)
- clientWidth (read-only)
- id
- innerHTML
- localeName (read-only)
- namespaceURI (read-only)
- nextElementSibling (read-only)
- outerHTML
- prefix (read-only)
- previousElementSibling (read-only)
- scrollHeight (read-only)
- scrollLeft
- scrollTop
- scrollWidth (read-only)
- tabStop
- tagName (read-only)
從 Node繼承的屬性:
- baseURI (read-only)
- childNodes (read-only)
- firstChild (read-only)
- lastChild (read-only)
- nextSibling (read-only)
- nodeName (read-only)
- nodeType (read-only)
- nodeValue
- ownerDocument (read-only)
- parentNode (read-only)
- parentElement (read-only)
- previousSibling (read-only)
- rootNode (read-only)
- textContent
6、Splash HTTP API調(diào)用
Splash通過(guò)HTTP API控制來(lái)發(fā)送GET請(qǐng)求或POST表單數(shù)據(jù),它提供了這些接口,只需要在請(qǐng)求時(shí)傳遞相應(yīng)的參數(shù)即可獲得不同的內(nèi)容,下面來(lái)介紹下這些接口
(1)render.html 它返回JavaScript渲染頁(yè)面的HTML代碼
參數(shù):
url:要渲染的網(wǎng)址,str類型
baseurl:用于呈現(xiàn)頁(yè)面的基本URL
timeout:渲染的超時(shí)時(shí)間默認(rèn)為30秒
resource_timeout:?jiǎn)蝹€(gè)網(wǎng)絡(luò)請(qǐng)求的超時(shí)時(shí)間
wait:加載頁(yè)面后等待更新的時(shí)間默認(rèn)為0
proxy:代理配置文件名稱或代理URL,格式為:[protocol://][user:password@]proxyhost[:port])
js:JavaScript配置
js_source:在頁(yè)面中執(zhí)行的JavaScript代碼
filtrs:以逗號(hào)分隔的請(qǐng)求過(guò)濾器名稱列表
allowed_domains:允許的域名列表
images:為1時(shí)下載圖像,為0時(shí)不下載圖像,默認(rèn)為1
headers:設(shè)置的HTTP標(biāo)頭,JSON數(shù)組
body:發(fā)送POST請(qǐng)求的數(shù)據(jù)
http_method:HTTP方法,默認(rèn)為GET
html5_media:是否啟用HTML5媒體,值為1啟用,0為禁用,默認(rèn)為0
1
2
3
4
|
import requests url = 'http://172.16.32.136:8050/' response = requests.get(url + 'render.html?url=https://www.baidu.com&wait=3&images=0' ) print (response.text) #返回網(wǎng)頁(yè)源代碼 |
(2)render.png 此接口獲取網(wǎng)頁(yè)的截圖PNG格式
1
2
3
4
5
6
|
import requests url = 'http://172.16.32.136:8050/' #指定圖像寬和高 response = requests.get(url + 'render.png?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1' ) with open ( 'taobao.png' , 'wb' ) as f: f.write(response.content) |
(3)render.jpeg 返回JPEG格式截圖
1
2
3
4
5
6
|
import requests url = 'http://172.16.32.136:8050/' response = requests.get(url + 'render.jpeg?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1' ) with open ( 'taobao.jpeg' , 'wb' ) as f: f.write(response.content) |
(4)render.har 此接口用于獲取頁(yè)面加載的HAR數(shù)據(jù)
1
2
3
4
5
|
import requests url = 'http://172.16.32.136:8050/' response = requests.get(url + 'render.har?url=https://www.jd.com&wait=5' ) print (response.text) |
(5)render.json 此接口包含了前面接口的所有功能,返回結(jié)果是JSON格式
參數(shù):
html:是否在輸出中包含HTML,html=1時(shí)包含html內(nèi)容,為0時(shí)不包含,默認(rèn)為0
png:是否包含PNG截圖,為1包含為0不包含默認(rèn)為0
jpeg:是否包含JPEG截圖,為1包含為0不包含默認(rèn)為0
iframes:是否在輸出中包含子幀的信息,默認(rèn)為0
script:是否輸出包含執(zhí)行的JavaScript語(yǔ)句的結(jié)果
console:是否輸出中包含已執(zhí)行的JavaScript控制臺(tái)消息
history:是否包含網(wǎng)頁(yè)主框架的請(qǐng)求與響應(yīng)的歷史記錄
har:是否輸出中包含HAR信息
1
2
3
4
5
|
import requests url = 'http://172.16.32.136:8050/' response = requests.get(url + 'render.json?url=https://httpbin.org&html=1&png=1&history=1&har=1' ) print (response.text) |
(6)execute 用此接口可以實(shí)現(xiàn)與Lua腳本的對(duì)接,它可以實(shí)現(xiàn)與頁(yè)面的交互操作
參數(shù):
lua_source:Lua腳本文件
timeout:設(shè)置超時(shí)
allowed_domains:指定允許的域名列表
proxy:指定代理
filters:指定篩選條件
1
2
3
4
5
6
7
8
9
10
|
import requests from urllib.parse import quote lua = ''' function main(splash) return 'hello' end ''' url = 'http://172.16.32.136:8050/execute?lua_source=' + quote(lua) response = requests.get(url) print (response.text) |
通過(guò)Lua腳本獲取頁(yè)面的body,url和狀態(tài)碼:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
import requests from urllib.parse import quote lua = ''' function main(splash,args) local treat=require("treat") local response=splash:http_get("http://httpbin.org/get") return { html=treat.as_string(response.body), url=response.url, status=response.status } end ''' url = 'http://172.16.32.136:8050/execute?lua_source=' + quote(lua) response = requests.get(url) print (response.text) # { "status" : 200 , "html" : "{\"args\":{},\"headers\":{\"Accept-Encoding\":\"gzip, deflate\",\"Accept-Language\":\"en,*\",\"Connection\":\"close\",\"Host\":\"httpbin.org\",\"User-Agent\":\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1\"},\"origin\":\"221.218.181.223\",\"url\":\"http://httpbin.org/get\"}\n" , "url" : <a href = "http://httpbin.org/get" >http: / / httpbin.org / get< / a>} |
7、實(shí)例
抓取JD python書(shū)籍?dāng)?shù)據(jù):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2018/7/9 13:33 # @Author : Py.qi # @File : JD.py # @Software: PyCharm import re import requests import pymongo from pyquery import PyQuery as pq client = pymongo.MongoClient( 'localhost' ,port = 27017 ) db = client[ 'JD' ] def page_parse(html): doc = pq(html,parser = 'html' ) items = doc( '#J_goodsList .gl-item' ).items() for item in items: if item( '.p-img img' ).attr( 'src' ): image = item( '.p-img img' ).attr( 'src' ) else : image = item( '.p-img img' ).attr( 'data-lazy-img' ) texts = { 'image' : 'https:' + image, 'price' :item( '.p-price' ).text()[: 6 ], 'title' :re.sub( '\n' ,' ',item(' .p - name').text()), 'commit' :item( '.p-commit' ).text()[: - 3 ], } yield texts def save_to_mongo(data): if db[ 'jd_collection' ].insert(data): print ( '保存到MongoDB成功' ,data) else : print ( 'MongoDB存儲(chǔ)錯(cuò)誤' ,data) def main(number): url = 'http://192.168.146.140:8050/render.html?url=https://search.jd.com/Search?keyword=python&page={}&wait=1&images=0' . format (number) response = requests.get(url) data = page_parse(response.text) for i in data: save_to_mongo(i) #print(i) if __name__ = = '__main__' : for number in range ( 1 , 200 , 2 ): print ( '開(kāi)始抓取第{}頁(yè)' . format (number)) main(number) |
更多內(nèi)容請(qǐng)查看官方文檔:http://splash.readthedocs.io/en/stable/
總結(jié)
以上就是這篇文章的全部?jī)?nèi)容了,希望本文的內(nèi)容對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,如果有疑問(wèn)大家可以留言交流,謝謝大家對(duì)服務(wù)器之家的支持。
原文鏈接:https://www.cnblogs.com/zhangxinqi/p/9279014.html