HTML parser for JavaScript
by gmosx, at 27 Dec 2009I needed an HTML parser for one of my projects. As I am using exclusively JavaScript lately (and loving it) I searched for a JavaScript solution. Envjs looked interesting but the messy source code (and global namespace pollution) was discomforting. Finally I decided to package Java HTML5 parser used in Envjs for Narwhal.
The tricky part was to make the parser compatible with Sizzle but I am happy to report that they both work great together now. You can find the source code for the package here. And here is the mandatory example:
var HTMLParser = require("htmlparser").HTMLParser,
sizzle = require("sizzle").sizzle;
var html = '<html><p id="header"><b>nice</b></p><div id="test" class="big">hello</div><div>second</div></html>',
parser = new HTMLParser(),
document = parser.parse(html),
$ = sizzle(document);
$("div").forEach(function(el) {
print(el.innerHTML);
});
print(document.toHTML());
print(document);Now it is easy to write your own server-side web scraper using familiar client-side tools!



This is my homepage - http://jskdh5jkd7djh4.com/l