gmosx

HTML parser for JavaScript

by gmosx, at 27 Dec 2009
I needed an HTML parser for one of my projects. As I am using exclusively JavaScript lately (and loving it) I searched for a JavaScript solution. Envjs looked interesting but the messy source code (and global namespace pollution) was discomforting. Finally I decided to package Java HTML5 parser used in Envjs for Narwhal

The tricky part was to make the parser compatible with Sizzle but I am happy to report that they both work great together now. You can find the source code for the package here. And here is the mandatory example:

var HTMLParser = require("htmlparser").HTMLParser,
    sizzle = require("sizzle").sizzle;

var html = '<html><p id="header"><b>nice</b></p><div id="test" class="big">hello</div><div>second</div></html>',
    parser = new HTMLParser(),
    document = parser.parse(html),
    $ = sizzle(document);

$("div").forEach(function(el) {
    print(el.innerHTML);
});

print(document.toHTML());
print(document);

Now it is easy to write your own server-side web scraper using familiar client-side tools!

7 Comments

by Roberto Saccon, at 27 Dec 2009
Cool ! George, your stuff rocks ! Does your HTML5 parser / sizzle package allow to do DOM-like queries at serverside as in the browser ? That would open the doors to a PURE-like template language, where the template is pure HTML, and it would run on server and browser.
by dionyziz, at 27 Dec 2009
If you're using Javascript within a browser, there is an excellent HTML parsing library that you can use, which has been tested and works very well, and is actually the most used HTML parsing library on the web today: The browser itself.
by George Moschovitis, at 09 Jan 2010
Roberto, yeah it allows DOM-like queries. For example you can use jQuery or Sizzle to play with the DOM.
by George Moschovitis, at 09 Jan 2010
dionyziz, the parser is intended for the server side. it emulates the browser so you can play with the DOM or use Sizzle or something. It is intended for web scraping etc...
by Roberto, at 20 Jan 2010
I am beta-testing your comment system. Maybe it would be easier to o-authenicate via twitter, than forcing users to decipher those images (well, rigth no twiiter is down)
by Roberto, at 20 Jan 2010
testing it agin for the browser caching issue (last comment) dit not show up
by Queevidualiep, at 02 Feb 2012
Hi! my identify is Jully. I would like to meemeet seemly brat :)
This is my homepage - http://jskdh5jkd7djh4.com/l