Parsing HTML on the server side with Node

Often times, you need your server to fetch external HTML pages and do something with the html data that comes back. 

For example, here is a simple code block that grabs the title of the page (if no title exists, return just the original url), using REGEX:

 
var url = ‘http://www.myawesomeblog.com';
var request = require('request');

request(url, function(err, res, html) {
    if (err) {
      console.log(‘ERROR’);
    } else {
      var match = html.match(/<title>(.*)<\/title>/);
      var result = match ? match[1] : url;
    }
}

This is fine and will work, since you’re extracting something pretty simple.  Most pages only have one <title> tag.  But it can easily get out of hand if you had to do something slightly more complex.  

Here, for example, is a code block that grabs the href attribute of all <a> elements (assuming you have a variable named "html" that’s a string of html data):

 
//match with REGEX where html is a string of html data
var linksStr = html.match(/<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g);

//go through array and grab just the URL elements
var results = [];

linksStr.forEach(function (item) {
  var matched = item.match(/href="([^"]*)"/);
  var urlString = matched[1];
  results.push(urlString);
});

Still works, but already getting pretty ugly.  What the !*@# is this thing: /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g

I don’t know either.  I just googled something like "REGEX for a link".

Now, I don’t know how much you like REGEX, but I like it as much as a taking a salty ice bath after getting 1000 paper cuts.  Luckily, there’s a better way.

Wouldn’t it be nice if you could do selection on the html like you can with jQuery on the front end?  Well, it turns out you can do exactly that with a nice little node module called cheerio (link to cheerio)

Let’s try to again grab the href attribute of all <a> elements (assuming you have a html variables that’s a string of html data)

 
var cheerio = require('cheerio');
$ = cheerio.load(html);

linksArray = $('a');
var results = []

$(linksArray).each(function(i, link){
    var urlString = $(link).attr('href');
    results.push(urlString);
}

There!  All done and you didn’t have to deal with any REGEX.  Wasn’t that nice?  Cheers to cheerio!