本站的My-Autopost采集器就是利用simple_html_dom解析类完成数据采集的,但是这个插件仅仅只能用于Wordpress,对于其它的cms无用。所以自己学习下,以后做成api的模式,可以适用于更多的cms,博客系统。其实也想过,用C++来实现,而且C++我相对PHP还熟点,但是C++毕竟麻烦,像windows和linux还需要编译多次,而且兼容性相对差点。PHP就不会了,虽然效率低了点。不多说,开始。。。
- 首先采集到内容
- 对内容的html进行压缩(去掉多余的空格和换行,方便以后标签通配,不知道simple_html_dom可不可以支持不换行的标签匹配,后面再试)
- 对div的id或class获取
<?php
require 'simple_html_dom.php';
$name = "leehom";
header ( 'Content-type:text/html;charset=UTF-8' );
// var_dump($GLOBALS);
// echo $GLOBALS['name'];
function test() {
$str = <<<HTML
<div>
<div>
<div class="foo bar">ok</div>
</div>
</div>
HTML;
echo str_get_html ( $str );
}
function compress_html($string) {
$string = str_replace ( "\r\n", '', $string ); // 清除换行符
$string = str_replace ( "\n", '', $string ); // 清除换行符
$string = str_replace ( "\t", '', $string ); // 清除制表符
$pattern = array (
"/> *([^ ]*) *</", // 去掉注释标记
"/[\s]+/",
"/<!--[^!]*-->/",
"/\" /",
"/ \"/",
"'/\*[^*]*\*/'"
);
$replace = array (
">\\1<",
" ",
"",
"\"",
"\"",
""
);
return preg_replace ( $pattern, $replace, $string );
}
function printValue($value, $tag = '') {
if (isset ( $value ))
echo $tag . '>' . $value;
}
function testByCssClass() {
$html = file_get_html ( "http://www.it72.com/tech" );
foreach ( ($html->find ( '.c-top h2 a' )) as $div ) {
// var_dump($div);
if (isset ( $div->plaintext )) {
printValue ( $div->plaintext );
}
if (isset ( $div->href )) {
printValue ( $div->href );
}
echo '<br/>';
}
$html->clear ();
}
function testByCssId() {
$html = file_get_html ( "http://www.it72.com/tech" );
foreach ( ($html->find ( '#post-(*) h2 a' )) as $div ) {
// var_dump($div);
if (isset ( $div->plaintext )) {
printValue ( $div->plaintext );
}
if (isset ( $div->href )) {
printValue ( $div->href );
}
echo '<br/>';
}
$html->clear ();
}
function testByText($start_tag, $end_tag) {
$html = file_get_html ( "http://www.it72.com/tech" );
$text = compress_html ( $html->innertext ); // 要先处理一下多余的空格和换行
$start = strpos ( $text, $start_tag );
while ( $start > 0 ) {
$end = strpos ( $text, '</article>', $start );
if ($end > 0) {
print_r ( substr ( $text, $start, $end - $start + strlen ( $end_tag ) ) );
$start = strpos ( $text, $start_tag, $end );
}
if ($start <= 0 || $end <= 0)
break;
}
$html->clear ();
}
testByText ( '<article class="post_box mb10">', '</article>' );
以上是测试代码
本文链接:https://it72.com:4443/9889.htm