Tuesday, October 16, 2012

Web Scraping - Open Source .Net Libraries

Nowadays, everyone may need some data from internet for different purposes. Some may need listings of businesses that she have to work with, some may need some book listing that she will sell on her website, some may need entire data of a website, and this may go beyond your imagination. You never know who will need what.

This becomes a major problem and companies or individuals seek for solutions. Here you can guess manual data entry as a solution, indeed it may be a solution but when the problem gets bigger you have to spend that much time on manual data entry. 
There is an alternative solution which is called Web Scraping. Scraping action is accomplished with automated software. In my case, I use .Net and tried several different libraries. I may list them as Watin, WebZinc, HtmlAgilityPack and HtmlAgilityPack's wrapper Fizzler. I will try to explain their differences.

Let me start with WebZinc. I found it when I needed it actually. A former client requested it to be used as main component of scraper application. It is not totally open source but you can use it as free or just buy it for 99$. When you use it for free, it pops up an alert window which requires you to click "OK" each time application runs. WebZinc has ability to visual browsing, which means that it initiates a browser instance and you can see what is going on. It also has non-visual methods which is good for applications which will run simultaneous or on a web server.

Watin is another option for web scraping, which I usually use. It requires you either choose Firefox or IE as visual browser. It basically manages that browser instance so you can check status of web pages and what is going on. You should choose IE which is better since Watin supports Firefox 3.6.28 which is very old. Watin has better documentation than WebZinc since it is very hard to find anything on google about WebZinc. Using Watin convention is simple, it has objects for almost each Html tag like Table, TableRow, TableCell, Form, Div, Button, Image, Para (which is actually p tag), List (ul or ol element), ListItem (li element), etc. Each of these objects have almost same actions like you can simply just call .Click() method to click on that element.

HtmlAgilityPack is the last option that you may use. This is not as functional as Watin, like you have to code a lot to just simply mimic click action of a button. It is good for solid Html text where you do not need to post or get anything via buttons or javascript. It is based on Webclient so you can use proxies easily and just parse the acquired html. It is actually based on XPath to select Html elements. If you have to use HtmlAgilityPack, then you need to use something on top of it, or you have to write your own Wrapper library.

Fizzler uses HtmlAgilityPack at basic level and adds some more functionality to it. But not more functional than HtmlAgilityPack. It adds css selector ability and you easily adapt if you are familiar with javascript.

As a conlusion, I prefer Watin, but in some cases it is not enough or suitable for web scraping. I will write more about Watin on later posts.

Let me know if anything missing or there is a mistake.
Mehmet.