Java实现爬取百度图片的方法分析

本文实例讲述了java实现爬取百度图片的方法。分享给大家供大家参考，具体如下：

在以往用java来处理解析html文档或者片段时，我们通常会采用htmlparser（ http://htmlparser.sourceforge.net/ ）这个开源类库。现在我们有了jsoup，以后的处理html的内容只需要使用jsoup就已经足够了，jsoup有更快的更新，更方便的api等。

jsoup 是一款 java 的html 解析器，可直接解析某个url地址、html文本内容。它提供了一套非常省力的api，可通过dom，css以及类似于jquery的操作方法来取出和操作数据，可以看作是java版的jquery。

jsoup的主要功能如下：

从一个url，文件或字符串中解析html；使用dom或css选择器来查找、取出数据；可操作html元素、属性、文本；

jsoup是基于mit协议发布的，可放心使用于商业项目。官方网站： http://jsoup.org/

步骤大致可以分为三个模块：一是获取网页的资源，二是解析获取的资源，取出我们想要的图片url地址，三是通过java的io存储在本地文件中。

获取网页资源的核心模块就是通过jsoup去获取网页的内容，具体核心代码如下：

								
									 private   static   list<jsoupimagevo> findimagenourl(string hotelid, string url,   int   timeout) { 

									       list<jsoupimagevo> result =   new   arraylist<jsoupimagevo>(); 

									       document document =   null  ; 

									       try   { 

									         document = jsoup.connect(url).data(  "query"  ,   "java"  )  //请求参数 

									             .useragent(  "mozilla/4.0 (compatible; msie 9.0; windows nt 6.1; trident/5.0)"  )  //设置urer-agent get(); 

									             .timeout(timeout) 

									             .get(); 

									         string xmlsource = document.tostring(); 

									         result = dealresult(xmlsource, hotelid); 

									       }   catch   (exception e) { 

									         string defaulturl =   "http://qnimg.zowoyoo测试数据/img/15463/1509533934407.jpg"  ; 

									         result = dealresult(defaulturl,hotelid); 

									       } 

									       return   result; 

									 }

其中url地址是百度图片搜索的地址，具体调用代码如下：

public static list<jsoupimagevo> findimage(string hotelname, string hotelid, int page) {

int number= 5 ;

string url = "http://image.baidu测试数据/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=" + hotelname + "&cg=star&pn=" + page * 30 + "&rn=" +number+ "&itg=0&z=0&fr=&width=&height=&lm=-1&ic=0&s=0&st=-1&gsm=" + integer.tohexstring(page * 30 );

int timeout = 5000 ;

return findimagenourl(hotelid, url, timeout);

}

这里需要注意的是：word是我们要搜索的关键字，pn是显示的页码，rn是一页显示多少个数据。

解析网页的资源，然后封装起来。核心代码如下：

								
									 private   static   list<jsoupimagevo> dealresult(string xmlsource, string hotelid) { 

									       list<jsoupimagevo> result =   new   arraylist<jsoupimagevo>(); 

									       xmlsource = stringescapeutils.unescapehtml3(xmlsource); 

									       string reg =   "objurl\":\"http://.+?\\.(gif|jpeg|png|jpg|bmp)"  ; 

									       pattern pattern = pattern测试数据pile(reg); 

									       matcher m = pattern.matcher(xmlsource); 

									       while   (m.find()) { 

									         jsoupimagevo jsoupimagevo =   new   jsoupimagevo(); 

									         string imageurl = m.group().substring(  9  ); 

									         if  (imageurl==  null   ||   ""  .equals(imageurl)){ 

									           string defaulturl =   "http://qnimg.zowoyoo测试数据/img/15463/1509533934407.jpg"  ; 

									           jsoupimagevo.seturl(defaulturl); 

									         }  else  { 

									           jsoupimagevo.seturl(imageurl); 

									         } 

									         jsoupimagevo.setname(hotelid); 

									         result.add(jsoupimagevo); 

									       } 

									       return   result; 

									 }

这里最主要的地方就是reg这个正则表达式，通过正则表达式，去网页中解析符合规定的图片url地址，然后封装在对象中。

最后一部分就是通过java的io流去图片地址获取图片，并保存在本地。核心代码如下：

								
									 //根据图片网络地址下载图片 

									 public   static   void   download(string url,string name,string path){ 

									       file file=   null  ; 

									       file dirfile=  null  ; 

									       fileoutputstream fos=  null  ; 

									       httpurlconnection httpcon =   null  ; 

									       urlconnection con =   null  ; 

									       url urlobj=  null  ; 

									       inputstream in =  null  ; 

									       byte  [] size =   new   byte  [  1024  ]; 

									       int   num=  0  ; 

									       try   { 

									         dirfile =   new   file(path); 

									         if  (dirfile.exists()){ 

									           dirfile.delete(); 

									         } 

									         dirfile.mkdir(); 

									         file =   new   file(path+  "//"  +name+  ".jpg"  ); 

									         fos =   new   fileoutputstream(file); 

									         if  (url.startswith(  "http"  )){ 

									           urlobj =   new   url(url); 

									           con = urlobj.openconnection(); 

									           httpcon =(httpurlconnection) con; 

									           in = httpcon.getinputstream(); 

									           while  ((num=in.read(size)) != -  1  ){ 

									             for  (  int   i=  0  ;i<num;i++) 

									               fos.write(size[i]); 

									           } 

									         } 

									       }  catch   (filenotfoundexception notfounde) { 

									         logutils.writelog(  "找不到该网络图片...."  ); 

									       }  catch  (nullpointerexception nullpointere){ 

									         logutils.writelog(  "找不到该网络图片...."  ); 

									       }  catch  (ioexception ioe){ 

									         logutils.writelog(  "产生io异常....."  ); 

									       }  catch   (exception e) { 

									         e.printstacktrace(); 

									       }  finally  { 

									         try   { 

									           fos.close(); 

									         }   catch   (exception e) { 

									           e.printstacktrace(); 

									         } 

									       } 

									 }

这里面的操作都是java中io篇一些基础的操作，有不懂的可以去看看java中io模块的内容。

因为我这边是maven项目，所以在开发前需要引入jsoup依赖才可以。

希望本文所述对大家java程序设计有所帮助。

原文链接：https://blog.csdn.net/hj7jay/article/details/84335161

查看更多关于Java实现爬取百度图片的方法分析的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did249659

更新时间：2023-07-10 阅读：34次