如何编写Python程序爬取新浪军事论坛？

回复内容：

context_re = r'(.*?)
'
你准备的这个正则表达式啊，truncated！断在了
这里，所以只能爬第一段。

爬取新浪军事论坛需要做三件事：

一、

上CSDN汪海老师的专栏， http:// blog.csdn.net/column/de tails/why-bug.html ，学习一个。

二、

按F12看一下前端。

三、

from bs4 import BeautifulSoup import requests response = requests . get ( "http://club.mil.news.sina测试数据.cn/thread-666013-1-1.html?retcode=0" ) #硬点网址 response . encoding = 'gb18030' #中文编码 soup = BeautifulSoup ( response . text , 'html.parser' ) #构建BeautifulSoup对象 divs = soup ( 'div' , 'mainbox' ) #每个楼层 for div in divs : comments = div . find_all ( 'div' , 'cont f14' ) #每个楼层的正文 with open ( 'Sina_Military_Club.txt' , 'a' ) as f : f . write ( ' \n ' + str ( comments ) + ' \n ' )
刚好几个小时前就在写一个爬取网站会员（公司）资料的小程序
具体的编程问题就不回答了，跟用什么语言写代码无关，关键是你要分析好这个页面的html代码结构，写出合适的正则表达式来进行匹配，如果想简化的话，可以进行分次匹配（比如先得到
里面的第一个
里面的内容就是原帖的地址，然后再进一步处理）
大数据分析就不会了，还请赐教。
import requests from bs4 import BeautifulSoup r = requests . get ( "http://club.mil.news.sina测试数据.cn/thread-666013-1-1.html" ) r . encoding = r . apparent_encoding soup = BeautifulSoup ( r . text ) result = soup . find ( attrs = { "class" : "cont f14" }) print result . text
用beautifulSoup吧,正则太多了看着都头疼. 先用了BeautifulSoup爬取数据
# -*- coding:utf-8 -*- import re , requests from bs4 import BeautifulSoup import sys reload ( sys ) sys . setdefaultencoding ( 'utf-8' ) url = "http://club.mil.news.sina测试数据.cn/viewthread.php?tid=666013&extra=page%3D1&page=1" req = requests . get ( url ) req . encoding = req . apparent_encoding html = req . text soup = BeautifulSoup ( html ) file = open ( 'sina_club.txt' , 'w' ) x = 1 for tag in soup . find_all ( 'div' , attrs = { 'class' : "cont f14" }): word = tag . get_text () line1 = "---------------评论" + str ( x ) + "---------------------" + " \n " line2 = word + " \n " line = line1 + line2 x += 1 file . write ( line ) file . close ()
哎，扒就扒吧，发了paper能不能告诉我刊号页数让我看一下？我们自己都没做大数据分析…… 建议用一下正则测试工具你需要pyquery，可以使用jquery一样的语法。你值得拥有。
https:// pythonhosted.org/pyquer y/
查看更多关于如何编写Python程序爬取新浪军事论坛？的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did83335

更新时间：2022-10-19 阅读：34次