一个非常简单的图片爬虫,通过一个页面的链接采集,然后访问单页面获取想要的大图,实现采集下载的目的,比较简单,由于是国外网站,访问会比较慢,推荐使用代理工具来实现。
目标网址:
https://thedieline.com/blog/2020/5/19/the-worlds-best-packaging-dieline-awards-2020-winners-revealed
网站没有什么反爬,页面结构也比较清晰以及简单,只要找准节点即可!
想要获取的链接有两个节点
节点一
xpath语法
hrefs=req.xpath('//p[@class="data-import-preserve"]/a/@href')
节点二
xpath语法
hrefs=req.xpath('//b[@class="data-import-preserve"]/a/@href')
通过以上两个节点应该能够获取到所有链接了,不过需要注意过滤一些无效链接,不然程序会报错或者出来无效数据。
图片下载超时处理
图片下载,做了一下超时处理,很简单的写法,try except处理,仅供参考。
爬取效果
采集效果
下载效果
附源码:
#thedieline采集 #20200520by 微信:huguo00289 # -*- coding: UTF-8 -*- import requests,time,re from fake_useragent import UserAgent from lxml import etree import os def ua(): ua = UserAgent() headers = {"User-Agent": ua.random} return headers def get_urllist(): url="https://thedieline.com/blog/2020/5/19/the-worlds-best-packaging-dieline-awards-2020-winners-revealed" response=requests.get(url,headers=ua(),timeout=8).content.decode('utf-8') req=etree.HTML(response) hrefs=req.xpath('//b[@class="data-import-preserve"]/a/@href') print(len(hrefs)) return hrefs def get_imgs(url): response = requests.get(url, headers=ua(), timeout=8).content.decode('utf-8') time.sleep(1) req = etree.HTML(response) title=req.xpath('//title/text()')[0] title=re.sub(r'[\|\/\<\>\:\*\?\\\"]', "_", title) # 剔除不合法字符 print(title) os.makedirs(f'{title}/',exist_ok=True) #创建目录 imgs=req.xpath('//figure[@class="data-import-preserve"]/img/@src') print(len(imgs)) i=1 for img in imgs: img_url=img img_name=f'{i}.jpeg' bctp(title, img_url, img_name) i=i+1 #下载图片 def bctp(lj,img_url,img_name): print("开始下载图片!") try: r = requests.get(img_url,headers=ua(),timeout=5) with open(f'{lj}/{img_name}', 'wb') as f: f.write(r.content) print(f'下载{img_name}图片成功!') time.sleep(1) except Exception as e: if "port=443): Read timed out" in str(e): time.sleep(2) try: r = requests.get(img_url, headers=ua(),timeout=5) with open(f'{lj}/{img_name}', 'wb') as f: f.write(r.content) print(f'下载{img_name}图片成功!') except Exception as e: print(f'下载{img_name}图片失败!') print(f'错误代码:{e}') with open(f'{lj}/spider.txt', 'a+', encoding='utf-8') as f: f.write(f'错误代码:{e}---下载 {img_url} 图片失败\n') else: print(f'下载{img_name}图片失败!') print(f'错误代码:{e}') with open(f'{lj}/spider.txt', 'a+', encoding='utf-8') as f: f.write(f'错误代码:{e}---下载 {img_url} 图片失败\n') def run(): hrefs=get_urllist() hrefs.remove("https://thedieline.com/blog/2020/5/6/-riceman") hrefs.remove("https://thedieline.com/blog/2020/3/6/srisangdao-rices-packaging-can-be-reused-as-tissue-box") hrefs.remove("https://thedieline.com/blog/2020/2/1/-revitalising-kelloggs") print(len(hrefs)) for href in hrefs: if "https://thedieline.com" in href: print(f'>>>正在爬取{href},采集中...') try: get_imgs(href) except: pass print(f'>>>采集完成!.') if __name__=='__main__': run()
查看更多关于一个简单的图片爬虫,Python图片采集下载的详细内容...
声明:本文来自网络,不代表【好得很程序员自学网】立场,转载请注明出处:http://haodehen.cn/did126095