好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

【爬虫】利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中(2)..

【爬虫】利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中(2)

第一篇( http://blog.itpub.net/26736162/viewspace-2286553/ )是将地址写入了txt文本文件中,这篇博客将爬取到的结果写入Excel表格中。

爬取到的结果:

小麦苗itpub博客链接地址.xlsx

Python爬取的源代码:

import?requests
import?re
import?xlwt
url?=?'http://blog.itpub.net/26736162/list/%d/'
pattern?=?re测试数据pile(r'<a?target=_blank?href="(.*?)"?class="w750"><p?class="title">(.*?)</p></a>')
#?pattern=re测试数据pile(r'<a?target=_blank?href="(.*?)"?class="w750"><p?class="title">')
#?ret=pattern.findall(data)
#?print(''.join(ret))
#?def?write2file(items):
#?????with?open('./download/lhrbest_itpub_link_title.txt','a',encoding='utf-8')?as?fp:
#?????????for?item?in?items:
#?????????????item=item[::-1]
#?????????????s=':'.join(item)
#?????????????#?print('----',len(items))
#?????????????fp.write(s+'\n')
#?????????????#?fp.write('---------------------------------------------------------------\n')
#?????pass
def?set_style(name,?height,colour_index,horz=xlwt.Alignment.HORZ_LEFT,bold=False):
????style?=?xlwt.XFStyle()??#?初始化样式
????font?=?xlwt.Font()??#?为样式创建字体
????font.name?=?name
????font.bold?=?bold
????font.colour_index?=?colour_index??#?1白2红3绿4蓝5黄?0?=?Black,?1?=?White,?2?=?Red,?3?=?Green,?4?=?Blue,?5?=?Yellow,?6?=?Magenta,?7?=?Cyan
????font.height?=?height?#0x190是16进制,换成10进制为400,然后除以20,就得到字体的大小为20
????style.font?=?font
????#?设置单元格对齐方式
????alignment?=?xlwt.Alignment()??#?创建alignment
????alignment.horz?=?horz??#?设置水平对齐为居中,May?be:?HORZ_GENERAL,?HORZ_LEFT,?HORZ_CENTER,?HORZ_RIGHT,?HORZ_FILLED,?HORZ_JUSTIFIED,?HORZ_CENTER_ACROSS_SEL,?HORZ_DISTRIBUTED
????alignment.vert?=?xlwt.Alignment.VERT_CENTER??#?设置垂直对齐为居中,May?be:?VERT_TOP,?VERT_CENTER,?VERT_BOTTOM,?VERT_JUSTIFIED,?VERT_DISTRIBUTED
????style.alignment?=?alignment??#?应用alignment到style3上
????#?设置单元格边框
????borders?=?xlwt.Borders()??#?创建borders
????borders.left?=?xlwt.Borders.DASHED??#?设置左边框的类型为虚线?May?be:?NO_LINE,?THIN,?MEDIUM,?DASHED,?DOTTED,?THICK,?DOUBLE,?HAIR,?MEDIUM_DASHED,?THIN_DASH_DOTTED,?MEDIUM_DASH_DOTTED,?THIN_DASH_DOT_DOTTED,?MEDIUM_DASH_DOT_DOTTED,?SLANTED_MEDIUM_DASH_DOTTED,?or?0x00?through?0x0D.
????borders.right?=?xlwt.Borders.THIN??#?设置右边框的类型为细线
????borders.top?=?xlwt.Borders.THIN??#?设置上边框的类型为打点的
????borders.bottom?=?xlwt.Borders.THIN??#?设置底部边框类型为粗线
????borders.left_colour?=?0x10??#?设置左边框线条颜色
????borders.right_colour?=?0x20
????borders.top_colour?=?0x30
????borders.bottom_colour?=?0x40
????style.borders?=?borders??#?将borders应用到style1上
????return?style
def?init_excel():
????f?=?xlwt.Workbook(encoding='gbk')??#?创建工作薄
????#?创建个人信息表
????sheet1?=?f.add_sheet(u'小麦苗itpub博客链接地址',?cell_overwrite_ok=True)
????sheet1.col(0).width?=?256?*?50
????sheet1.col(1).width?=?256?*?50
????rowTitle?=?[u'博客文章标题',?u'链接地址']
????#?rowDatas?=?[[u'张一',?u'男',?u'18'],?[u'李二',?u'女',?u'20'],?[u'黄三',?u'男',?u'38'],?[u'刘四',?u'男',?u'88']]
????for?i?in?range(0,?len(rowTitle)):
????????sheet1.write(0,?i,?rowTitle[i],?set_style('Courier?New',?220,?2,?xlwt.Alignment.HORZ_CENTER,?True))??#?后面是设置样式
????f.save('./download/excel_write_base.xlsx')
????return??f,sheet1
#?写excel
def?write_excel(rowDatas,f,rowIndex):
????f_excel=f[0]
????f_sheet=f[1]
????rowIndex=?rowIndex?if?rowIndex?==?0?else?rowIndex*20
????for?k?in?range(0,?len(rowDatas)):??#?先遍历外层的集合,即每行数据
????????????for?j?in?range(0,?len(rowDatas[k])):??#?再遍历内层集合
????????????????if?j?==?1:
????????????????????#?写入数据,k+1表示先去掉标题行,另外每一行数据也会变化,j正好表示第一列数据的变化,rowdatas[k][j]?插入数据
????????????????????f_sheet.write(k?+rowIndex+?1,?j,
?????????????????????????????????xlwt.Formula('HYPERLINK("%s","%s")'?%?(rowDatas[k][::-1][j],?rowDatas[k][::-1][j])),set_style('Courier?New',?180,4))
????????????????else:
????????????????????f_sheet.write(k?+rowIndex+?1,?j,?rowDatas[k][::-1][j],set_style('Courier?New',?180,0))
????????????????f_excel.save('./download/excel_write_base.xlsx')
headers?=?{
????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/63.0.3239.84?Safari/537.36'}
def?loadHtml(page):
????if?page?>=?1:
????????f=init_excel()?#初始化一个Excel工作簿,包括sheet
????????for?p?in?range(1,?page?+?1):
????????????url_itpub?=?url?%?(p)
????????????print(url_itpub)
????????????response?=?requests.get(url=url_itpub,?headers=headers)
????????????response.encoding?=?'utf-8'
????????????content?=?response.text
????????????#?print(content)
????????????#?Ctrl?+?Alt?+?V:提取变量
????????????items?=?pattern.findall(content)
????????????#?print(items)
????????????#?write2file(items)
????????????write_excel(items,f,p-1)
????????pass
????else:
????????print('请输入数字!!!')
????pass
if?__name__?==?'__main__':
????page?=?int(input('请输入需要爬取多少页:'))
????loadHtml(page)

?

About Me

........................................................................................................................

● 本文作者:小麦苗,部分内容整理自网络,若有侵权请联系小麦苗删除

● 本文在itpub( http://blog.itpub.net/26736162 )、博客园( http://HdhCmsTestcnblogs测试数据/lhrbest )和个人weixin公众号( xiaomaimiaolhr )上有同步更新

● 本文itpub地址: http://blog.itpub.net/26736162

● 本文博客园地址: http://HdhCmsTestcnblogs测试数据/lhrbest

● 本文pdf版、个人简介及小麦苗云盘地址: http://blog.itpub.net/26736162/viewspace-1624453/

● 数据库笔试面试题库及解答: http://blog.itpub.net/26736162/viewspace-2134706/

● DBA宝典今日头条号地址: http://HdhCmsTesttoutiao测试数据/c/user/6401772890/#mid=1564638659405826

........................................................................................................................

● QQ群号: 230161599 (满) 、618766405

● weixin群:可加我weixin,我拉大家进群,非诚勿扰

● 联系我请加QQ好友 ( 646634621 ) ,注明添加缘由

● 于 2018-12-01 06:00 ~ 2018-12-31 24:00 在魔都完成

● 最新修改时间:2018-12-01 06:00 ~ 2018-12-31 24:00

● 文章内容来源于小麦苗的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解

● 版权所有,欢迎分享本文,转载请保留出处

........................................................................................................................

● 小麦苗的微店 : https://weidian测试数据/s/793741433?wfr=c&ifr=shopdetail

● 小麦苗出版的数据库类丛书 : http://blog.itpub.net/26736162/viewspace-2142121/

● 小麦苗OCP、OCM、高可用网络班 : http://blog.itpub.net/26736162/viewspace-2148098/

● 小麦苗腾讯课堂主页 : https://lhr.ke.qq测试数据/

........................................................................................................................

使用 weixin客户端 扫描下面的二维码来关注小麦苗的weixin公众号( xiaomaimiaolhr )及QQ群(DBA宝典)、添加小麦苗weixin, 学习最实用的数据库技术。

........................................................................................................................

?

?

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26736162/viewspace-2286652/,如需转载,请注明出处,否则将追究法律责任。

查看更多关于【爬虫】利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中(2)..的详细内容...

  阅读:36次