使用Httpclient和Jsoup 爬取网页信息（二）

1.httpclient抓取网页代码示范

String URL = "网址";
CloseableHttpClient httpClient=HttpClients.custom().build();
HttpGet httpGet = new HttpGet(URL);
CloseableHttpResponse response=null;
try {
    response=httpClient.execute(httpGet);
    if(response.getStatusLine().getStatusCode()==200){
        if(response.getEntity() !=null){
            return EntityUtils.toString(response.getEntity(),"utf-8");
        }else{
            return "";
        }
    }
} catch (IOException e) {
    e.printStackTrace();
}finally{
    if(response!=null){
        try {
            response.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

2.jsoup解析网页的基本步骤

1.获取httpclient返回的字符串
2.将字符串转成封装好的 Document 类
3.找到需要的元素，使用提供的api进行解析

3.jsoup使用代码示范

使用jsoup解析页面的时候有两种方式找到元素：
1.使用提供的dom方法找（不方便）
2.使用选择器语法找

Document document = Jsoup.parse( httpclient返回的字符串 );
Elements elements = document.select("img[data-sku]");
for(Element element : elements){
    element.attr("data-sku");
    element.text();
}

attr() 方法是获取一个属性里面的值，传的参数是属性名。
text() 方法是获取标签里面的文本元素，就是<> </> 中间的文本元素。
还有更多的方法提供用户使用，需要的可以去查找。