以视频爬取实例讲解Python爬虫神器BeautifulSoup用法

1.安装BeautifulSoup4
easy_install安装方式,easy_install需要提前安装

easy_install beautifulsoup4

pip安装方式,pip也需要提前安装.此外PyPi中还有一个名字是 BeautifulSoup 的包,那是 Beautiful Soup3 的发布版本.在这里不建议安装.

pip install beautifulsoup4

Debain或ubuntu安装方式

apt-get install Python-bs4

你也可以通过源码安装,下载BS4源码

Python setup.py install

2.小试牛刀

# coding=utf-8
'''
@通过BeautifulSoup下载百度贴吧图片
'''
import urllib
from bs4 import BeautifulSoup
url = 'http://tieba.baidu测试数据/p/3537654215'

# 下载网页
html = urllib.urlopen(url)
content = html.read()
html.close()

# 使用BeautifulSoup匹配图片
html_soup = BeautifulSoup(content)
# 图片代码我们在[Python爬虫基础1--urllib]( http://blog.xiaolud测试数据/2015/01/22/spider-1st/ "Python爬虫基础1--urllib")里面已经分析过了
# 相较通过正则表达式去匹配,BeautifulSoup提供了一个更简单灵活的方式
all_img_links = html_soup.findAll('img', class_='BDE_Image')

# 接下来就是老生常谈的下载图片
img_counter = 1
for img_link in all_img_links:
  img_name = '%s.jpg' % img_counter
  urllib.urlretrieve(img_link['src'], img_name)
  img_counter += 1

很简单,代码注释里面已经解释的很清楚了.BeautifulSoup提供了一个更简单灵活的方式,去分析网站源码,更快获取图片link.

3.爬取实例
3.1基本的抓取技术
在写一个爬虫脚本时，第一件事情就是手动观察要抓取的页面来确定数据如何定位。

首先，我们要看一看在 http://pyvideo.org/category/50/pycon-us-2014 上的 PyCon 大会视频列表。检查这个页面的 HTML 源代码我们发现视频列表的结果差不多是长这样的：

...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did89419

更新时间：2022-10-19 阅读：53次