PHP采集静态页面并把页面下载css,img,js保存
这是一个可以获取网页的html代码以及css,js,font和img资源的小工具,主要用来快速获取模板,如果你来不及设计UI或者看到不错的模板,则可以使用这个工具来抓取网页和提取资源文件,提取的内容会按相对路径来保存资源,因此你不必担心资源文件的错误url导入.
首页 index.php,代码如下:
<!DOCTYPE html> <html> <head> <meta name= "author" content= "flute" /> <meta http-equiv= "Content-Type" content= "text/html; charset=utf-8" /> <title>网页抓取器</title> <link rel= "stylesheet" href= "main.css" media= "all" /> <script type= "text/javascript" src= "jquery.js" ></script> <script type= "text/javascript" src= "main.js" ></script> </head> <body> <h1>Web Grabber</h1> <hr /> <div class = "box" > <h2>Url</h2> <div class = "form" > <input type= "text" id= "project" value= "projectname" /> <input type= "text" id= "url" value= "http://" size= "60" /> <button class = "submit" type= "button" >Get</button><span id= "tip" ></span> </div> www.phpfensi.com </div> <div class = "box" > <span class = "all" id= "saveall" >Save All</span> <h2>List</h2> <ul id= "list" > </ul> </div> </body> </html>抓取页面代码 grab.php,代码如下:
<?PHP /* * flute * 2014/03/31 */ if (isset( $_POST [ 'url' ])) { if (isset( $_POST [ 'project' ]) && ! is_dir ( $_POST [ 'project' ])) mkdir ( $_POST [ 'project' ], 0777); echo json_encode(grab( $_POST [ 'url' ])); } function grab( $url ) { //$url = 'http://ldixing-wordpress.stor.sinaapp.com/uploads/leaves/test.html'; $data = array (); $file = preg_replace( '/^.*//' , '' , $url ); if (( $content = file_get_contents ( $url )) !== false) { if (isset( $_POST [ 'project' ])) file_put_contents ( $_POST [ 'project' ]. '/' . $file , $content ); $pattern = '/<link.*?href=(' |")(.*?.css)1.*?>/i'; if (preg_match_all( $pattern , $content , $matches )) { $data [ 'css' ] = $matches [2]; } $pattern = '/<script.*?src=(' |")(.*?.js)1.*?>/i'; if (preg_match_all( $pattern , $content , $matches )) { $data [ 'js' ] = $matches [2]; } $pattern = '/<img.*?src=(' |")(.*?)1.*?>/i'; if (preg_match_all( $pattern , $content , $matches )) { $data [ 'img' ] = $matches [2]; } $pattern = '/url((' |"|s)(.*?)1)/i'; if (preg_match_all( $pattern , $content , $matches )) { $data [ 'src' ] = $matches [2]; } } return $data ; } function vardump( $obj ) { echo '<pre>' ; print_r( $obj ); echo '</pre>' ; } ?>保存css,js,img等资源的页面 save.php,代码如下:
<?PHP /* * flute * 2014/03/31 */ if (isset( $_POST [ 'url' ]) && isset( $_POST [ 'project' ]) && isset( $_POST [ 'domain' ])) { extract( $_POST ); $url = preg_replace( '/?.*$/' , '' , $url ); $file = $url ; $arr = explode ( '/' , $file ); $length = sizeof( $arr ); $filename = $arr [ $length - 1]; $root = $project ; $dir = $root ; if ( $domain == 'http' ) { $dir = $root . '/http' ; if (! is_dir ( $dir )) mkdir ( $dir , 0777); } else { $file = $domain . '/' . $url ; for ( $i = 0; $i < $length -1; $i ++) { if (! empty empty ( $arr [ $i ])) { $dir .= '/' . $arr [ $i ]; if (! is_dir ( $dir )) mkdir ( $dir , 0777); } //开源代码phpfensi.com } } if (! file_exists ( $dir . '/' . $filename ) || filesize ( $dir . '/' . $filename ) == 0) { $content = file_get_contents ( $file ); file_put_contents ( $dir . '/' . $filename , $content ); } } ?>使用方法:
1. 打开index页,输入项目名和要抓取的网址,网址必须是文件名结尾,如index.html;
2. 点Get按钮,得到当前页面所有的css,js,img等资源列表;
3. 点击css链接会获取css文件中的背景资源图片,附加在列表后头;
4. 点击Save All即可保存列表中所有的文件,并按相对路径生成;
5. 如果网页上有http远程文件,将会直接保存在http文件夹下;
6. Get和Save有时会失败,没关系重试几次即可。
查看更多关于PHP采集静态页面并把页面下载css,img,js保存 - php高的详细内容...
声明:本文来自网络,不代表【好得很程序员自学网】立场,转载请注明出处:http://haodehen.cn/did30374