新浪微博需要登錄才能爬取,這里使用m.weibo.cn這個移動端網站即可實現簡化操作,用這個訪問可以直接得到的微博id。
分析新浪微博的評論獲取方式得知,其采用動態加載。所以使用json模塊解析json代碼
單獨編寫了字符優化函數,解決微博評論中的嘈雜干擾字符
本函數是用python寫網絡爬蟲的終極目的,所以采用函數化方式編寫,方便后期優化和添加各種功能
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
# -*- coding:gbk -*- import re import requests import json from lxml import html #測試微博4054483400791767 comments = [] def get_page(weibo_id): url = 'https://m.weibo.cn/status/{}' . format (weibo_id) html = requests.get(url).text regcount = r '"comments_count": (.*?),' comments_count = re.findall(regcount,html)[ - 1 ] comments_count_number = int (comments_count) page = int (comments_count_number / 10 ) return page - 1 def opt_comment(comment): tree = html.fromstring(comment) strcom = tree.xpath( 'string(.)' ) reg1 = r '回復@.*?:' reg2 = r '回覆@.*?:' reg3 = r '//@.*' newstr = '' comment1 = re.subn(reg1,newstr,strcom)[ 0 ] comment2 = re.subn(reg2,newstr,comment1)[ 0 ] comment3 = re.subn(reg3,newstr,comment2)[ 0 ] return comment3 def get_responses( id ,page): url = "https://m.weibo.cn/api/comments/show?id={}&page={}" . format ( id ,page) response = requests.get(url) return response def get_weibo_comments(response): json_response = json.loads(response.text) for i in range ( 0 , len (json_response[ 'data' ])): comment = opt_comment(json_response[ 'data' ][i][ 'text' ]) comments.append(comment) weibo_id = input ( "輸入微博id,自動返回前5頁評論:" ) weibo_id = int (weibo_id) print ( '\n' ) page = get_page(weibo_id) for page in range ( 1 ,page + 1 ): response = get_responses(weibo_id,page) get_weibo_comments(response) for com in comments: print (com) print ( len (comments)) |
以上所述是小編給大家介紹的python爬取新浪微博評論詳解整合,希望對大家有所幫助,如果大家有任何疑問請給我留言,小編會及時回復大家的。在此也非常感謝大家對服務器之家網站的支持!
原文鏈接:https://blog.csdn.net/Joliph/article/details/77334354