利用TaskManager爬取2万条代理IP实现自动投票功能

1.是否能一个人投多票,如果不行又是什么限制了一人投多票?

答：投票网站限制了一个ip或者一个用户只能投一票,防止恶意刷票行为

2.如果是一个ip一票那是否代表着多个ip就能投多票了呢?

答：答案是肯定的

3.用什么方法能够在代码里面改变自己请求的ip?

答：http请求的时候设置代理ip

4.多个代理ip从哪里获取,获取到之后我又该如何使用代码自动化投票？

答：请看文章后面内容

本篇将介绍taskmanager内置任务-代理ip爬虫实现细节,你需要准备的知识：htmlagilitypack解析html,quart.net。

代理ip介绍

百度百科介绍：代理（英语：proxy），也称网络代理，是一种特殊的网络服务，允许一个网络终端（一般为客户端）通过这个服务与另一个网络终端（一般为服务器）进行非直接的连接。一些网关、路由器等网络设备具备网络代理功能。一般认为代理服务有利于保障网络终端的隐私或安全，防止攻击。

目前有很多厂商提供代理ip在线获取,但是很多都是提供几十个试用的，如果想使用更多的代理ip，则需付费购买。这里我找到了一个提供很多代理ip的网站，可以自行百度[代理ip](以免认为我打广告)，或者参考开源taskmanager介绍这篇文章。

有了这么多在线的代理ip可以解决文章开头的问题4了，可是还有个问题这些数据都是网页上的，我在代码里面怎么使用呢？这就用到了htmlagilitypack工具包，看名称就能猜到是用来解析html的。

htmlagilitypack使用

htmlagilitypack是一个开源的解析html元素的类库，最大的特点是可以通过xpath来解析hmtl，如果您以前用c#操作过xml，那么使用起htmlagilitypack也会得心应手。

解析简单的html

string html = @"<html><head><title>简单解析测试</title></head><body> <div id= 'div1' title= 'div1' > <table> <tr> <td>1</td> <td title= 'cn' >cn</td> </tr> </table> </div> </body></html>"; var doc = new htmldocument(); doc.loadhtml(html); //输出页面标题 console.writeline( "页面title:" +doc.documentnode.selectsinglenode( "/html/head/title" ).innertext); //获取div1节点方式1 htmlnode divnode1 = doc.getelementbyid( "div1" ); //获取div1节点方式2 htmlnode divnode2 = doc.documentnode.selectsinglenode( "//div[@id='div1']" ); //判断节点1和节点2是否相同 console.writeline( "断节点1和节点2是否相同:" + (divnode1 == divnode2)); //获取页面所有table htmlnodecollection tablecollection = doc.documentnode.selectnodes( "//table" ); console.writeline( "页面table数量:" +tablecollection.count); //获取table下所有td并输出信息 htmlnodecollection tdcollection = tablecollection[0].selectnodes( "tr/td" ); foreach ( var td in tdcollection) { htmlattribute atr = td.attributes[ "title" ]; console.writeline( "td innertext:" + td.innertext + " | td title属性值:" + (atr == null ? "" : atr.value)); } console.read();

代理ip爬虫实现

会了htmlagilitypack的一些简单操作之后进入正式爬取过程,由于需要爬取的网页带ip封锁功能(一段时间请求频率过高封锁当前ip)，在设计过程中我采用了爬取五次自动换代理ip突破网站限制(感觉自己坏坏的)。

整体实现逻辑

在.net里面使用webrequest可以模拟http的get post请求,最终要的一点能设置请求时使用的代理ip，重点关注我标红的代码

/// <summary>

/// 代理使用示例

/// </summary>

/// <param name="url"></param>

/// <param name="type"></param>

/// <returns></returns>

public static string geturltohtml( string url, string type)

{

try

{

system.net.webrequest wreq = system.net.webrequest.create(url);

webproxy myproxy = new webproxy( "192.168.15.11" , 8015);

//建议连接（代理需要身份认证，才需要用户名密码）

myproxy.credentials = new networkcredential( "admin" , "123456" );

//设置请求使用代理信息

wreq.proxy = myproxy;

// get the response instance.

system.net.webresponse wresp = wreq.getresponse();

system.io.stream respstream = wresp.getresponsestream();

// dim reader as streamreader = new streamreader(respstream)

using (system.io.streamreader reader = new system.io.streamreader(respstream, encoding.getencoding(type)))

{

return reader.readtoend();

}

catch (system.exception ex)

{

//errormsg = ex.message;

}

return "" ;

}

了解如何使用代理ip，离我们的目标又近了一步，下面就是代理ip获取的实现了,由于代码有点多，我这里只贴出重要部分，ipproxyget.cs源码可到文章末尾自行下载。

/// <summary>

/// 获取总页数

/// </summary>

/// <returns>总页数</returns>

private static int gettotalpage( string ipurl, string proxyip)

{

var doc = new htmldocument();

doc.loadhtml(gethtml(ipurl, proxyip));

var res = doc.documentnode.selectnodes( @"//div[@class='pagination']/a" );

if (res != null && res.count > 2)

{

int page;

if ( int .tryparse(res[res.count - 2].innertext, out page))

{

return page;

}

return 1;

}

解析每一页html数据

/// <summary>

/// 解析每一页数据

/// </summary>

/// <param name="param"></param>

private static void dowork( object param)

{

//参数还原

hashtable table = param as hashtable;

int start = convert.toint32(table[ "start" ]);

int end = convert.toint32(table[ "end" ]);

list<ipproxy> list = table[ "list" ] as list<ipproxy>;

proxyparam param = table[ "param" ] as proxyparam;

//页面地址

string url = string .empty;

string ip = string .empty;

ipproxy item = null ;

htmlnodecollection nodes = null ;

htmlnode node = null ;

htmlattribute atr = null ;

for ( int i = start; i <= end; i++)

{

loghelper.writelog( string .format( "开始解析,页码{0}~{1},当前页码{2}" , start, end, i));

url = string .format( "{0}/{1}" , param.ipurl, i);

var doc = new htmldocument();

doc.loadhtml(gethtml(url, param.proxyip));

//获取所有数据节点tr

var trs = doc.documentnode.selectnodes( @"//table[@id='ip_list']/tr" );

if (trs != null && trs.count > 1)

{

loghelper.writelog( string .format( "当前页码{0},请求地址{1},共{2}条数据" , i, url, trs.count));

for ( int j = 1; j < trs.count; j++)

{

nodes = trs[j].selectnodes( "td" );

if (nodes != null && nodes.count > 9)

{

ip = nodes[2].innertext.trim();

if (param.ispingip && !ping(ip))

{

continue ;

}

//有效的ip才添加

item = new ipproxy();

node = nodes[1].firstchild;

if (node != null )

{

atr = node.attributes[ "alt" ];

if (atr != null )

{

item.country = atr.value.trim();

}

item.ip = ip;

item.port = nodes[3].innertext.trim();

item.proxyip = getip(item.ip, item.port);

item.position = nodes[4].innertext.trim();

item.anonymity = nodes[5].innertext.trim();

item.type = nodes[6].innertext.trim();

node = nodes[7].selectsinglenode( "div[@class='bar']" );

if (node != null )

{

atr = node.attributes[ "title" ];

if (atr != null )

{

item.speed = atr.value.trim();

}

node = nodes[8].selectsinglenode( "div[@class='bar']" );

if (node != null )

{

atr = node.attributes[ "title" ];

if (atr != null )

{

item.connecttime = atr.value.trim();

}

item.verifytime = nodes[9].innertext.trim();

list.add(item);

}

loghelper.writelog( string .format( "当前页码{0},共{1}条数据" , i, trs.count));

}

loghelper.writelog( string .format( "结束解析,页码{0}~{1},当前页码{2}" , start, end, i));

}

最终会获取2万多条数据

自动投票简单实现

这里使用.net的webbrowser控件来加载页面，最终效果如下

#region 设置代理ip

private void button2_click( object sender, eventargs e)

{

string proxy = this .textbox1.text;

refreshiesettings(proxy);

ieproxy ie = new ieproxy(proxy);

ie.refreshiesettings();

//messagebox.show(ie.refreshiesettings().tostring());

}

#endregion

#region 取消代理ip

private void button3_click( object sender, eventargs e)

{

ieproxy ie = new ieproxy( null );

ie.disableieproxy();

}

#endregion

#region 打开网页

private void button1_click( object sender, eventargs e)

{

string url = txt_url.text.trim();

if ( string .isnullorempty(url))

{

messagebox.show( "请输入要打开的网址" );

return ;

}

this .webbrowser1.navigate(url, null , null , null );

}

#endregion

总结

本篇要介绍的内容到此结束了，下面写点我的期待！希望有喜欢的朋友一起来完善taskmanager(完全开源的),使之成为一款能够提高生活便捷性的工具，添加很多新任务。比如：第二天要下雨或者下雪，发个邮件提醒，带上雨伞…。好了到了放出源代码的时间了。敬请期待下一篇！

dy("nrwz");

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did57488

更新时间：2022-09-26 阅读：93次