本文實例講述了Python爬蟲爬取新浪微博內(nèi)容。分享給大家供大家參考,具體如下:
用Python編寫爬蟲,爬取微博大V的微博內(nèi)容,本文以女神的微博為例(爬新浪m站:https://m.weibo.cn/u/1259110474)
一般做爬蟲爬取網(wǎng)站,首選的都是m站,其次是wap站,最后考慮PC站。當然,這不是絕對的,有的時候PC站的信息最全,而你又恰好需要全部的信息,那么PC站是你的首選。一般m站都以m開頭后接域名, 所以本文開搞的網(wǎng)址就是 m.weibo.cn。
前期準備
1.代理IP
網(wǎng)上有很多免費代理ip,如西刺免費代理IPhttp://www.xicidaili.com/,自己可找一個可以使用的進行測試;
2.抓包分析
通過抓包獲取微博內(nèi)容地址,這里不再細說,不明白的小伙伴可以自行百度查找相關(guān)資料,下面直接上完整的代碼
完整代碼:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
# -*- coding: utf-8 -*- import urllib.request import json #定義要爬取的微博大V的微博ID id = '1259110474' #設(shè)置代理IP proxy_addr = "122.241.72.191:808" #定義頁面打開函數(shù) def use_proxy(url,proxy_addr): req = urllib.request.Request(url) req.add_header( "User-Agent" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0" ) proxy = urllib.request.ProxyHandler({ 'http' :proxy_addr}) opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(req).read().decode( 'utf-8' , 'ignore' ) return data #獲取微博主頁的containerid,爬取微博內(nèi)容時需要此id def get_containerid(url): data = use_proxy(url,proxy_addr) content = json.loads(data).get( 'data' ) for data in content.get( 'tabsInfo' ).get( 'tabs' ): if (data.get( 'tab_type' ) = = 'weibo' ): containerid = data.get( 'containerid' ) return containerid #獲取微博大V賬號的用戶基本信息,如:微博昵稱、微博地址、微博頭像、關(guān)注人數(shù)、粉絲數(shù)、性別、等級等 def get_userInfo( id ): url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id data = use_proxy(url,proxy_addr) content = json.loads(data).get( 'data' ) profile_image_url = content.get( 'userInfo' ).get( 'profile_image_url' ) description = content.get( 'userInfo' ).get( 'description' ) profile_url = content.get( 'userInfo' ).get( 'profile_url' ) verified = content.get( 'userInfo' ).get( 'verified' ) guanzhu = content.get( 'userInfo' ).get( 'follow_count' ) name = content.get( 'userInfo' ).get( 'screen_name' ) fensi = content.get( 'userInfo' ).get( 'followers_count' ) gender = content.get( 'userInfo' ).get( 'gender' ) urank = content.get( 'userInfo' ).get( 'urank' ) print ( "微博昵稱:" + name + "\n" + "微博主頁地址:" + profile_url + "\n" + "微博頭像地址:" + profile_image_url + "\n" + "是否認證:" + str (verified) + "\n" + "微博說明:" + description + "\n" + "關(guān)注人數(shù):" + str (guanzhu) + "\n" + "粉絲數(shù):" + str (fensi) + "\n" + "性別:" + gender + "\n" + "微博等級:" + str (urank) + "\n" ) #獲取微博內(nèi)容信息,并保存到文本中,內(nèi)容包括:每條微博的內(nèi)容、微博詳情頁面地址、點贊數(shù)、評論數(shù)、轉(zhuǎn)發(fā)數(shù)等 def get_weibo( id , file ): i = 1 while True : url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id weibo_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id + '&containerid=' + get_containerid(url) + '&page=' + str (i) try : data = use_proxy(weibo_url,proxy_addr) content = json.loads(data).get( 'data' ) cards = content.get( 'cards' ) if ( len (cards)> 0 ): for j in range ( len (cards)): print ( "-----正在爬取第" + str (i) + "頁,第" + str (j) + "條微博------" ) card_type = cards[j].get( 'card_type' ) if (card_type = = 9 ): mblog = cards[j].get( 'mblog' ) attitudes_count = mblog.get( 'attitudes_count' ) comments_count = mblog.get( 'comments_count' ) created_at = mblog.get( 'created_at' ) reposts_count = mblog.get( 'reposts_count' ) scheme = cards[j].get( 'scheme' ) text = mblog.get( 'text' ) with open ( file , 'a' ,encoding = 'utf-8' ) as fh: fh.write( "----第" + str (i) + "頁,第" + str (j) + "條微博----" + "\n" ) fh.write( "微博地址:" + str (scheme) + "\n" + "發(fā)布時間:" + str (created_at) + "\n" + "微博內(nèi)容:" + text + "\n" + "點贊數(shù):" + str (attitudes_count) + "\n" + "評論數(shù):" + str (comments_count) + "\n" + "轉(zhuǎn)發(fā)數(shù):" + str (reposts_count) + "\n" ) i + = 1 else : break except Exception as e: print (e) pass if __name__ = = "__main__" : file = id + ".txt" get_userInfo( id ) get_weibo( id , file ) |
爬取結(jié)果
希望本文所述對大家Python程序設(shè)計有所幫助。
原文鏈接:https://blog.csdn.net/d1240673769/article/details/74278547