php爬虫保存 php爬虫框架phpspider

很多站长朋友们都不太清楚php爬虫保存，今天小编就来给大家整理php爬虫保存，希望对各位有所帮助，具体内容如下：

本文目录一览： 1、抓取网页数据怎么保存到数据库 php 2、如何入门 php 爬虫 3、 php爬虫高手帮忙 4、如何用PHP做网络爬虫 5、 php定时抓html内容和保存读取抓取网页数据怎么保存到数据库 php

给一个例子你看看吧.

if($pro_list_contents=@file_get_contents(''))

{

preg_match_all("/<td width=\"50%\" valign=\"top\">(.*)<td width=\"10\"><img src=\"images\/spacer.gif\"/isU", $pro_list_contents, $pro_list_contents_ary);

for($i=0; $i<count($pro_list_contents_ary[1]); $i++)

{

preg_match_all("/<a href=\"(.*)\"><img src=\"(.*)\".*<span>(.*)<\/span>/isU", $pro_list_contents_ary[1][$i], $url_img_price);

$url=addslashes($url_img_price[1][0]);

$img=str_replace(' ', '20%', trim(''.$url_img_price[2][0]));

$price=(float)str_replace('$', '', $url_img_price[3][0]);

preg_match_all("/<a class=\"ml1\" href=\".*\">(.*)<\/a>/isU", $pro_list_contents_ary[1][$i], $proname_ary);

$proname=addslashes($proname_ary[1][0]);

include("inc/db_connections.php");

$rs=mysql_query("select * from pro where Url='$url' and CateId='{$cate_row['CateId']}'"); //是否已经采集了

if(mysql_num_rows($rs))

{

echo "跳过：{$url}<br>";

continue;

}

$basedir='/u_file/pro/img/'.date('H/');

$save_dir=Build_dir($basedir); //创建目录函数

$ext_name = GetFileExtName( $img ); //取得图片后辍名

$SaveName = date( 'mdHis' ) . rand( 10000, 99999 ) . '.' . $ext_name;

if( $get_file=@file_get_contents( $img ) )

{

$fp = @fopen( $save_dir . $SaveName, 'w' );

@fwrite( $fp, $get_file );

@fclose( $fp );

@chmod( $save_dir . $SaveName, 0777 );

@copy( $save_dir . $SaveName, $save_dir . 'small_'.$SaveName );

$imgpath=$basedir.'small_'.$SaveName;

}

else

{

$imgpath='';

}

if($pro_intro_contents=@file_get_contents($url))

{

preg_match_all("/<\/h1>(.*)<\/td><\/tr>/isU", $pro_intro_contents, $pro_intro_contents_ary);

$p_contents=addslashes(str_replace('src="', 'src="', $pro_intro_contents_ary[1][0]));

$p_contents=SaveRemoteImg($p_contents, '/u_file/pro/intro/'.date('H/')); //把远程html代码里的图片保存到本地

}

$t=time();

mysql_query("insert into pro(CateId, ProName, PicPath_0, S_PicPath_0, Price_0, Contents, AddTime, Url) values('{$cate_row['CateId']}', '$proname', '$imgpath', '$img', '$price', '$p_contents', '$t', '$url')");

echo $url.$img.$cate."<br>\r\n";

}

如何入门 php 爬虫

从爬虫基本要求来看：

抓取：抓取最基本就是拉网页回来，所以第一步就是拉网页回来，慢慢会发现各种问题待优化；

存储：抓回来一般会用一定策略存下来，可以选择存文件系统开始，然后以一定规则命名。

分析：对网页进行文本分析，可以用认为最快最优的办法，比如正则表达式；

展示：要是做了一堆事情，一点展示输出都没有，如何展现价值。

php爬虫高手帮忙

采集吧，最好定时采集，发现最新的就保存到服务器，减轻服务器压力。

如何用PHP做网络爬虫

其实用PHP来爬会非常方便，主要是PHP的正则表达式功能在搜集页面连接方面很方便，另外PHP的fopen、file_get_contents以及libcur的函数非常方便的下载网页内容。

具体处理方式就是建立就一个任务队列，往队列里面插入一些种子任务和可以开始爬行，爬行的过程就是循环的从队列里面提取一个URL，打开后获取连接插入队列中，进行相关的保存。队列可以使用数组实现。

当然PHP作为但线程的东西，慢慢爬还是可以，怕的就是有的URL打不开，会死在那里。

php定时抓html内容和保存读取

<?php

@header ( 'Content-type: text/html;charset=UTF-8' );

$name = "AA";

$seconds = 60;

$url = "./";

$html = $url . $name . ".html";

$file = $name . ".dat";

set_time_limit ( 0 );

while ( file_exists ( $file ) ) {

$info = file_get_contents ( $html );

$info = iconv ( "UTF-8", "GBK", $info );

echo $info;

if (preg_match ( "/((?:(?!)[\s\S])*)/", $info, $m )) {

$fh = fopen ( $file, "w" );

fwrite ( $fh, $m [1] );

fclose ( $fh );

}

sleep ( $seconds );

}

关于php爬虫保存的介绍到此就结束了，不知道本篇文章是否对您有帮助呢？如果你还想了解更多此类信息，记得收藏关注本站，我们会不定期更新哦。

查看更多关于php爬虫保存 php爬虫框架phpspider的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did210528

更新时间：2023-05-03 阅读：21次