webcrawler 网页爬虫（二）

Posted On 2013年10月29日

调试发现，通过webdriver 使用htmlunitdriver 并无法获得javascript 执行后的html dom 树。而只是可以获取的html的source 页面。

note： sourcepage 和经过script解析的html dom ，暂且这么叫。 sourcepage 是经过php，jsp，sevlet也好，输出的未经过浏览器（客户端）进行解析的html code。而真正展现在用户面前的，是经过script，css解析后，友好展现在用户面前的真实数据。

例如,我写了一个简单的html页面。有一句简单的js语句。

<html><body>welcome<div><script>document.write(‘<span>hello</span>’);</script></div></body></html>

打开页面后，我们察看source page，则看到的如上的代码。

如果我们使用带有javascript的引擎的浏览器时，则获得到的dom 树实际为如下的内容(注意，不同的浏览器内核实际的展现会稍有区别)：

<html><body>welcome<div> <script>//<![CDATA[document.write(‘<span>hello</span>’);//]]> </script> <span> hello</span> </div> </body></html>

不知道各位看官看后，有所明白了没。

所以我修改了爬虫的获取html的代码。 由原来的。

WebDriver driver = new HtmlUnitDriver(); // And now use this to visit Google driver.get(“http://blog.whoistester.com/test.html”);
String pageSource = driver.getPageSource(); System.out.println(pageSource);

改为如下代码，直接使用htmlunit，而不通过selenium的webdriver了。

HtmlPage page = null;

try { WebClient webclient = new WebClient(); webclient.setThrowExceptionOnScriptError(false); webclient.setThrowExceptionOnFailingStatusCode(false); webclient.setPopupBlockerEnabled(true); webclient.setJavaScriptErrorListener(javaScriptErrorListener); page = webclient.getPage(“http://blog.whoistester.com/test.html”);

} catch (Exception e) { e.printStackTrace(); }

if(page != null) { System.out.println(page.asXml()); ; }

这时候我们就可以抓取一些带有ajax的网页的内容了。

此篇文章已被阅读3468 次

Tags:crawler, htmlunit

About The Author

The Tester

技术交流，生活学习

相关文章

Related Posts

randoop开源自动生成单元测试用例工具

Springboot application启动missing EmbeddedServletContainerFactory错误解决

spring mvc 乱码问题汇总

About The Author

The Tester

Add a Comment