国产成人一区二区三区,精品第一页,欧美中文一区二区三区

一、需求

最近基于 Material Design 重構了自己的新聞 App，數據來源是個問題。

有前人分析了知乎日報、鳳凰新聞等 API，根據相應的 URL 可以獲取新聞的 JSON 數據。為了鍛煉寫代碼能力，筆者打算爬蟲新聞頁面，自己獲取數據構建 API。

二、效果圖

下圖是原網站的頁面

Java實現爬蟲給App提供數據（Jsoup 網絡爬蟲）

爬蟲獲取了數據，展示到 APP 手機端

Java實現爬蟲給App提供數據（Jsoup 網絡爬蟲）

三、爬蟲思路

Java實現爬蟲給App提供數據（Jsoup 網絡爬蟲）

關于App 的實現過程可以參看這幾篇文章，本文主要講解一下如何爬蟲數據。

Jsoup 簡介

Jsoup 是一個 Java 的開源HTML解析器，可直接解析某個URL地址、HTML文本內容。

Jsoup主要有以下功能：

- 從一個URL，文件或字符串中解析HTML；
- 使用DOM或CSS選擇器來查找、取出數據；
- 對HTML元素、屬性、文本進行操作；
- 清除不受信任的HTML (來防止XSS攻擊)

四、爬蟲過程

Get 請求獲取網頁 HTML

新聞網頁Html的DOM樹如下所示：

Java實現爬蟲給App提供數據（Jsoup 網絡爬蟲）

下面這段代碼根據指定的 url，用代碼獲取get 請求返回的 html 源代碼。

				?

									public static String doGet(String urlStr) throws CommonException {

									 URL url;

									 String html = "";

									 try {

									 url = new URL(urlStr);

									 HttpURLConnection connection = (HttpURLConnection) url.openConnection();

									 connection.setRequestMethod("GET");

									 connection.setConnectTimeout(5000);

									 connection.setDoInput(true);

									 connection.setDoOutput(true);

									 if (connection.getResponseCode() == 200) {

									 InputStream in = connection.getInputStream();

									 html = StreamTool.inToStringByByte(in);

									 } else {

									 throw new CommonException("新聞服務器返回值不為200");

									 }

									 } catch (Exception e) {

									 e.printStackTrace();

									 throw new CommonException("get請求失敗");

									 }

									 return html;

									}

InputStream in = connection.getInputStream();將得到輸入流轉化為字符串是個普遍需求，我們將其抽象出來，寫一個工具方法。

				?

									public class StreamTool {

									 public static String inToStringByByte(InputStream in) throws Exception {

									 ByteArrayOutputStream outStr = new ByteArrayOutputStream();

									 byte[] buffer = new byte[1024];

									 int len = 0;

									 StringBuilder content = new StringBuilder();

									 while ((len = in.read(buffer)) != -1) {

									 content.append(new String(buffer, 0, len, "UTF-8"));

									 }

									 outStr.close();

									 return content.toString();

									 }

									}

五、解析 HTML 獲取標題

利用 google 瀏覽器的審查元素，找出新聞標題對于的html 代碼：

				?

									<div id="article_title">

									 <h1>

									 <a href="http://see.xidian.edu.cn/html/news/7428.html">

									 關于舉辦《經典音樂作品欣賞與人文審美》講座的通知

									 </a>

									 </h1>

									</div>

我們需要從上面的 HTML 中找出id="article_title"的部分，使用 getElementById(String id) 方法

				?

									String htmlStr = HttpTool.doGet(urlStr);

									// 將獲取的網頁 HTML 源代碼轉化為 Document

									Document doc = Jsoup.parse(htmlStr);

									Element articleEle = doc.getElementById("article");

									// 標題

									Element titleEle = articleEle.getElementById("article_title");

									String titleStr = titleEle.text();

六、獲取發布日期、信息來源

同樣找出對于的 HTML 代碼

				?

									<html>

									 <head></head>

									 <body>

									 <div id="article_detail"> 

									 <span> 2015-05-28 </span> 

									 <span> 來源: </span> 

									 <span> 瀏覽次數: <script language="JavaScript" src="http://see.xidian.edu.cn/index.php/news/click/id/7428">

									 </script> 477 </span> 

									 </div>

									 </body>

									</html>

思路也和上面類似，使用 getElementById(String id) 方法找出id="article_detail"為Element，再利用getElementsByTag獲取span 部分。因為一共有3個<span> ... </span>，所以返回的是Elements而不是Element。

				?

									// article_detail包括了 2016-01-15 來源: 瀏覽次數:177

									Element detailEle = articleEle.getElementById("article_detail");

									Elements details = detailEle.getElementsByTag("span");

									// 發布時間

									String dateStr = details.get(0).text();

									// 新聞來源

									String sourceStr = details.get(1).text();

七、解析瀏覽次數

如果打印出上面的details.get(2).text()，只會得到

瀏覽次數:
沒有瀏覽次數？為什么呢？

因為瀏覽次數是JavaScript 渲染出來的， Jsoup爬蟲可能僅僅提取HTML內容，得不到動態渲染出的數據。
解決方法有兩種

在爬蟲的時候，內置一個瀏覽器內核，執行js渲染頁面后，再抓取。這方面對應的工具有Selenium、HtmlUnit或者PhantomJs。
所以分析JS請求，找到對應數據的請求url

如果你訪問上面的 urlhttp://see.xidian.edu.cn/index.php/news/click/id/7428，會得到下面的結果

				?

									document.write(478)

這個478就是我們需要的瀏覽次數，我們對上面的url做get 請求，得到返回的字符串，利用正則找出其中的數字。

				?

									// 訪問這個新聞頁面，瀏覽次數會+1，次數是 JS 渲染的

									String jsStr = HttpTool.doGet(COUNT_BASE_URL + currentPage);

									int readTimes = Integer.parseInt(jsStr.replaceAll("\D+", ""));

									// 或者使用下面這個正則方法

									// String readTimesStr = jsStr.replaceAll("[^0-9]", "");

八、解析新聞內容

本來是獲取新聞內容純文字的形式，但后來發現 Android 端也可以顯示 CSS 格式，所以后來內容保留了 HTML 格式。

				?

									Element contentEle = articleEle.getElementById("article_content");

									// 新聞主體內容

									String contentStr = contentEle.toString();

									// 如果用 text()方法，新聞主體內容的 html 標簽會丟失

									// 為了在 Android 上用 WebView 顯示 html，用toString()

									// String contentStr = contentEle.text();

九、解析圖片 Url

注意一個網頁上大大小小的圖片很多，為了只獲取新聞正文中的內容，我們最好首先定位到新聞內容的Element，然后再利用getElementsByTag(“img”)篩選出圖片。

				?

									Element contentEle = articleEle.getElementById("article_content");

									// 新聞主體內容

									String contentStr = contentEle.toString();

									// 如果用 text()方法，新聞主體內容的 html 標簽會丟失

									// 為了在 Android 上用 WebView 顯示 html，用toString()

									// String contentStr = contentEle.text();

									Elements images = contentEle.getElementsByTag("img");

									String[] imageUrls = new String[images.size()];

									for (int i = 0; i < imageUrls.length; i++) {

									 imageUrls[i] = images.get(i).attr("src");

									}

十、新聞實體類 JavaBean

上面獲取了新聞的標題、發布日期、閱讀次數、新聞內容等等，我們自然需要構造一個 javabean，把獲取的內容封裝進實體類中。

public class ArticleItem {

private int index;

private String[] imageUrls;

private String title;

private String publishDate;

private String source;

private int readTimes;

private String body;

public ArticleItem(int index, String[] imageUrls, String title, String publishDate, String source, int readTimes,

String body) {

this.index = index;

this.imageUrls = imageUrls;

this.title = title;

this.publishDate = publishDate;

this.source = source;

this.readTimes = readTimes;

this.body = body;

}

@Override

public String toString() {

return "ArticleItem [index=" + index + ", imageUrls=" + Arrays.toString(imageUrls) + ", title=" + title

+ ", publishDate=" + publishDate + ", source=" + source + ", readTimes=" + readTimes + ", body=" + body

+ "]";

}

}

測試

				?

									public static ArticleItem getNewsItem(int currentPage) throws CommonException {

									 // 根據后綴的數字，拼接新聞 url

									 String urlStr = ARTICLE_BASE_URL + currentPage + ".html";

									 String htmlStr = HttpTool.doGet(urlStr);

									 Document doc = Jsoup.parse(htmlStr);

									 Element articleEle = doc.getElementById("article");

									 // 標題

									 Element titleEle = articleEle.getElementById("article_title");

									 String titleStr = titleEle.text();

									 // article_detail包括了 2016-01-15 來源: 瀏覽次數:177

									 Element detailEle = articleEle.getElementById("article_detail");

									 Elements details = detailEle.getElementsByTag("span");

									 // 發布時間

									 String dateStr = details.get(0).text();

									 // 新聞來源

									 String sourceStr = details.get(1).text();

									 // 訪問這個新聞頁面，瀏覽次數會+1，次數是 JS 渲染的

									 String jsStr = HttpTool.doGet(COUNT_BASE_URL + currentPage);

									 int readTimes = Integer.parseInt(jsStr.replaceAll("\D+", ""));

									 // 或者使用下面這個正則方法

									 // String readTimesStr = jsStr.replaceAll("[^0-9]", "");

									 Element contentEle = articleEle.getElementById("article_content");

									 // 新聞主體內容

									 String contentStr = contentEle.toString();

									 // 如果用 text()方法，新聞主體內容的 html 標簽會丟失

									 // 為了在 Android 上用 WebView 顯示 html，用toString()

									 // String contentStr = contentEle.text();

									 Elements images = contentEle.getElementsByTag("img");

									 String[] imageUrls = new String[images.size()];

									 for (int i = 0; i < imageUrls.length; i++) {

									 imageUrls[i] = images.get(i).attr("src");

									 }

									 return new ArticleItem(currentPage, imageUrls, titleStr, dateStr, sourceStr, readTimes, contentStr);

									}

									public static void main(String[] args) throws CommonException {

									 System.out.println(getNewsItem(7928));

									}

輸出信息

				?

									ArticleItem [index=7928,

									 imageUrls=[/uploads/image/20160114/20160114225911_34428.png],

									 title=電院2014級開展“讓誠信之花開遍冬日校園”教育活動,

									 publishDate=2016-01-14,

									 source=來源: 電影新聞網,

									 readTimes=200,

									 body=<div id="article_content">

									 <p style="text-indent:2em;" align="justify"> <strong><span style="font-size:16px;line-height:1.5;">西電新聞網訊</span></strong><span style="font-size:16px;line-height:1.5;"> （通訊員</span><strong><span style="font-size:16px;line-height:1.5;"> 丁彤 王朱丹</span></strong><span style="font-size:16px;line-height:1.5;">...）