python多线程下载有声小说

有经验的老鸟都(未婚的)会在公司附近租房，免受舟车劳顿之苦的同时节约了大把时间；也有些人出于某种原因需要每天披星戴月地游走于公司于家之间，很不幸俺就是这其中一员。由于家和公司离得比较远，我平时在公交车上的时间占据了工作时间段的1/4，再加上杭州一向有中国的拉斯维加斯之称(堵城)，每每堵起来，哥都能想象自己成为变形金刚。这段漫长时间我想作为每个程序猿来说是无法忍受的，可是既然短时间无法改变生存的现状，咱就好好利用这段时间吧。所以，我特地买了大屏幕的Note II 以便看pdf，另外耳朵也不能闲着，不过咱不是听英语而是听小说，我在读书的时候就喜欢听广播，特别是说书、相声等，所以我需要大量的有声小说，现在网上这些资源多的很，但是下载页记为麻烦，为了挣取更多的流量和广告点击，这些网站的下载链接都需要打开至少两个以上的网页才能找到真正的链接，甚是麻烦，为了节省整体下载时间，我写了这个小程序，方便自己和大家下载有声小说（当然，还有任何其他类型的资源）

　　先说明一下，我不是为了爬很多资料和数据，仅仅是为了娱乐和学习，所以这里不会漫无目的的取爬取一个网站的所有链接，而是给定一个小说，比方说我要下载小说《童年》，我会在我听评书网上找到该小说的主页然后用程序下载所有mp3音频，具体做法见下面代码，所有代码都在模块crawler5tps中：

1. 先设定一下start url 和保存文件的目录

 1   #  -*-coding:GBK-*- 
 2   import   urllib,urllib2
  3   import   re,threading,os
  4  
 5  
 6  baseurl =  '  http://HdhCmsTest5tps测试数据  '   #  base url  
 7  down2path =  '  E:/enovel/  '          #  saving path 
 8  save2path =  ''                    #  saving file name (full path)

2. 从start url 解析下载页面的url

  1   def   parseUrl(starturl):
   2       ''' 
  3       parse out download page from start url.
   4       eg. we can get 'http://HdhCmsTest5tps测试数据/down/8297_52_1_1.html' from 'http://HdhCmsTest5tps测试数据/html/8297.html'
   5       ''' 
  6       global   save2path
   7      rDownloadUrl = re测试数据pile( "  .*?<A href=\'(/down/\w+\.html)\'.*  " )  #  find the link of download page 
  8       #  rTitle = re测试数据pile("<TITILE>.{4}\s{1}(.*)\s{1}.*</TITLE>") 
  9       #  <TITLE>有声小说 闷骚1 播音:刘涛 全集</TITLE> 
 10      f =  urllib2.urlopen(starturl)
  11      totalLine =   f.readlines()
  12      
　　　　''' create the name of saving file '''
 13      title = totalLine[3].split( "   " )[1 ]
  14       if  os.path.exists(down2path+title)  is   not   True:
  15          os.mkdir(down2path+ title)
  16          save2path = down2path+title+ "  /  " 
 17      
 18      downUrlLine = [ line  for  line  in  totalLine  if   rDownloadUrl.match(line)]
  19      downLoadUrl =  [];
  20       for  dl  in   downUrlLine:
  21           while   True:
  22              m =  rDownloadUrl.match(dl)
  23               if   not   m:
  24                   break 
 25              downUrl = m.group(1 )
  26               downLoadUrl.append(downUrl.strip())
  27              dl = dl.replace(downUrl, ''  )
  28       return  downLoadUrl

3. 从下载页面解析出真正的下载链接

  1   def   getDownlaodLink(starturl):
   2       ''' 
  3       find out the real download link from download page.
   4       eg. we can get the download link 'http://180j-d.ysts8测试数据:8000/人物纪实/童年/001.mp3?   5       1251746750178x1356330062x1251747362932-3492f04cf54428055a110a176297d95a' from    6       'http://HdhCmsTest5tps测试数据/down/8297_52_1_1.html'
   7       ''' 
  8      downUrl =  []
   9      gbk_ClickWord =  '  点此下载  ' 
 10      downloadUrl =  parseUrl(starturl)
  11      rDownUrl = re测试数据pile( '  <a href=\"(.*)\"><font color=\"blue\">  ' +gbk_ClickWord+ '  .*</a>  ' )  #  find the real download link 
 12       for  url  in   downloadUrl:
  13          realurl = baseurl+ url
  14           print   realurl
  15           for  line  in   urllib2.urlopen(realurl).readlines():
  16              m =  rDownUrl.match(line)
  17               if   m:
  18                  downUrl.append(m.group(1 ))
  19    
 20       return  downUrl

4. 定义下载函数

 1   def   download(url,filename):
  2       '''   download mp3 file   ''' 
 3       print   url
  4      urllib.urlretrieve(url, filename)

5. 创建用于下载文件的线程类

 1   class   DownloadThread(threading.Thread):
  2       '''   dowanload thread class   ''' 
 3       def   __init__  (self,func,savePath):
  4          threading.Thread. __init__  (self)
  5          self.function =  func
  6          self.savePath =  savePath
  7      
 8       def   run(self):
  9          download(self.function,self.savePath)

6. 开始下载

  1   if   __name__  ==  '  __main__  '  :
   2      starturl =  '  http://HdhCmsTest5tps测试数据/html/8297.html  ' 
  3      downUrl =  getDownlaodLink(starturl)
   4      aliveThreadDict = {}         #   alive thread 
  5      downloadingUrlDict = {}      #   downloading link 
  6  
  7      i =  0;
   8       while  i <  len(downUrl):
   9           '''   Note:我听评说网 只允许同时有三个线程下载同一部小说，但是有时受网络等影响，  10                           为确保下载的是真实的mp3，这里将线程数设为2   ''' 
 11           while  len(downloadingUrlDict)< 2  :
  12              downloadingUrlDict[i]= i
  13              i += 1
 14           for  urlIndex  in   downloadingUrlDict.values():
  15               #  argsTuple = (downUrl[urlIndex],save2path+str(urlIndex+1)+'.mp3') 
 16               if  urlIndex  not   in   aliveThreadDict.values():
  17                  t = DownloadThread(downUrl[urlIndex],save2path+str(urlIndex+1)+ '  .mp3  '  )
  18                   t.start()
  19                  aliveThreadDict[t]= urlIndex
  20           for  (th,urlIndex)  in   aliveThreadDict.items():
  21               if  th.isAlive()  is   not   True:
  22                   del  aliveThreadDict[th]  #   delete the thread slot 
 23                   del  downloadingUrlDict[urlIndex]  #   delete the url from url list needed to download  
 24      
 25       print   '  Completed Download Work  '

这样就可以了，让他尽情的下吧，咱还得码其他的项目去，哎 >>>

等下了班copy到Note中就可以一边听小说一边看资料啦，最后附上源码。

分类: python

标签: crawler , python多线程

作者： Leo_wl

出处： http://HdhCmsTestcnblogs测试数据/Leo_wl/

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

版权信息

查看更多关于python多线程下载有声小说的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did47408

更新时间：2022-09-24 阅读：43次