If you have read my blog article about badly behaved bots you will know that I recently had a developer from a job aggregator tell me that their robot couldn't parse a Robots.txt file. I am guessing they just didn't want to parse it due to the fact they would have found their user-agent mentioned in almost everyone of my sites! However even though I spend a lot of my time combatting rouge crawler traffic I do occasionally have to use my own crawler to extract data from other sites for various purposes. Therefore as its always best to follow a site owners rules as specified in their Robots.txt file to prevent yourself getting blocked from the site I thought I would put up some code that shows how to use C# to parse a Robots.txt file. The code shown is pretty basic but gives an example of the following:
- Setting the user-agent value for your robot.
- How to request a page through a proxy.
- Requesting a page, handling any HTTP errors (404, 500, 403 etc) and returning the response.
- Parsing a sites Robots.txt fileand storing the rules that apply to your user-agent.
- Checking a URL before accessing it to see if the Robots.txt file prohibits your bot from accessing it.
Please visit http://www.robotstxt.org/ if you require more details about the Robots.txt file and the various supported commands.