Parsing the Robots.txt File with C-Sharp

If you have read my blog article about badly behaved bots you will know that I recently had a developer from a job aggregator tell me that their robot couldn't parse a Robots.txt file. I am guessing they just didn't want to parse it due to the fact they would have found their user-agent mentioned in almost everyone of my sites! However even though I spend a lot of my time combatting rouge crawler traffic I do occasionally have to use my own crawler to extract data from other sites for various purposes. Therefore as its always best to follow a site owners rules as specified in their Robots.txt file to prevent yourself getting blocked from the site I thought I would put up some code that shows how to use C# to parse a Robots.txt file. The code shown is pretty basic but gives an example of the following:

  • Setting the user-agent value for your robot.
  • How to request a page through a proxy.
  • Requesting a page, handling any HTTP errors (404, 500, 403 etc) and returning the response.
  • Parsing a sites Robots.txt fileand storing the rules that apply to your user-agent.
  • Checking a URL before accessing it to see if the Robots.txt file prohibits your bot from accessing it.

Please visit http://www.robotstxt.org/ if you require more details about the Robots.txt file and the various supported commands.

The Robot.txt File Parser Application

The code is contained within the following download and contains two files.

  • Robot.cs which contains the main functionality including the robot parsing method.
  • Program.cs which is a C# console application test harness which demonstrates the code in action.

The core code is in contained within Robot.cs class file which contains two classes. One which parses each line of the Robots.txt file and the other which has the following methods:

  • getURLContent: This method takes a URL as a parameter and accesses it through a proxy server if required and returns the HTML source if possible.
  • ParseRobotsTxtFile: This method takes a URL as a parameter and accesses that sites Robot.txt file and parses it looking for rules that apply to the user-agent the robot is using. It stores any URLs that the robot cannot access in an internal array which is then checked when the the URLIsAllowed method is called.
  • URLIsAllowed: This method takes a URL as its parameter and returns a boolean value for whether the URL can be accessed or not.

As you can see the code contains just the core methods required to make a basic robot and if you wanted to extend it to make a crawler you can use these methods in a loop with your own code to parse the HTML content, extract links and other content and then methods to save the content to file or database and then follow relevant links. If you want to see the code in action then you should build the application and then run the generated executable from the command prompt. The application will pipe out debug to the console so that you can see the robots.txt file being parsed and any confirmation of whether a URL can be accessed or not.

Remember you should always try to use a valid user-agent when crawling rather than spoofing a browsers or another robots agent. If you identity yourself properly you will have less chance of being banned. If you spoof another bot such as Googlebot you might fall foul of a white list that links known crawler agents with IP addresses. Also you should try not to hammer sites as this is a sure fire way of getting blocked. Use a delay of a few seconds that you randomly generate in between each request as this will help mask your visit in between other visitors e.g:


/* wait a random amount of seconds in between requests to make us look like normal users */
/* also we don't want to hammer the sites server. Behave and they may not kick you off!! */
Random random = new Random();
int waitfor;

/* create a random integer between 10 and 35*/
waitfor = random.Next(10,35);

/* lets hang about a bit */
System.Threading.Thread.Sleep(waitfor); 

There is nothing like causing an unusually high load on a server to raise suspicions so behave and you won't get blocked. Read my tips for crawlers if you want more details.

Robot.txt Parser Test Harness

The file Program.cs contains a simple console application to test the code.


class Program
{
	static void Main(string[] args)
	{
		Console.WriteLine("Start Bot Crawl Test");
					
		string SiteURL = "http://blog.strictly-software.com";            
		
		/* Pass in the URL for the site or any URL for a site and this method */
		/* will create the correct path to access the sites Robot.txt file  */
		/* and then parse it.            */
		Robot.ParseRobotsTxtFile(SiteURL);


		/* This URL will be allowed by the robots.txt */
		string URL = "http://blog.strictly-software.com/unpacker.asp";

		/* Once parsed we can check whether URLs can be accessed following */
		/* the rules in the sites Robots.txt file. */
		if (Robot.URLIsAllowed(URL))
		{
			Console.WriteLine("This URL is allowed by the Robots.txt file");
		}
		else
		{
			Console.WriteLine("This URL is NOT allowed by the Robots.txt file");
		}

		/* This URL won't be allowed as a command in my robots.txt will block it */
		URL = "http://blog.strictly-software.com/search?q=robot";	   				   
		
		if (Robot.URLIsAllowed(URL))
		{
			Console.WriteLine("This URL is allowed by the Robots.txt file");
		}
		else
		{
			Console.WriteLine("This URL is NOT allowed by the Robots.txt file");
		}

		Console.WriteLine("End Bot Crawl Test");
	}
}

As you can see I am accessing the Robots.txt file from my own blog which contains the following rules:


User-agent: Mediapartners-Google
Disallow: 

User-agent: *
Disallow: /search

Sitemap: http://blog.strictly-software.com/feeds/posts/default?orderby=updated

Its a pretty basic Robots.txt file that allows Googles Adsense bot access to the whole of my site with the Disallow: command. When the command is blank it behaves like a negative and therefore says to Allow access to everything. The only restricted part of the site is the /search directory which is blocked for all agents with the User-agent: * command.

Although the only standard commands are Disallow and User-agent some crawlers such as Googlebot also obey commands such as Allow and Sitemap.

I have set the user-agent that the Robot uses to my own string Mozilla 5.0; RobsRobot 1.2; www.strictly-software.com; which means that when I call the Robot.URLIsAllowed(URL) method it should return saying I am allowed to access the URL http://www.strictly-software.com/unpacker.asp. To test that the opposite is true you should change the URL that is passed into the URLIsAllowed method to be http://blog.strictly-software.com/search?q=robot as this will fall foul of the second rule in my Robots.txt file that blocks all agents from accessing the path: /search e.g:


/* This URL will not be allowed due to a command in my robots.txt */
string URL = "http://blog.strictly-software.com/search?q=robot";

As you will see when running the application from the console the code handles the fact that the rule in the robots.txt is relative and the URL passed to the function is absolute as well as having a querystring. The code that handles this is the URLIsAllowed method:


public static bool URLIsAllowed(string URL)
{            
	/* If we have no URLS stored in our blocked array exit now */
	if (_BlockedUrls.Count == 0) return true;

	/* Convert our string into an Uri object so we can easily access the */
	/* relative path excluding the host and domain etc. */
	Uri checkURL = new Uri(URL);
	URL = checkURL.AbsolutePath.ToLower();

	Console.WriteLine("Is user-agent: " + _robotAgent + " allowed access to URL: " + URL);

	/* if URL is the /robots.txt then don't allow it as we should use the ParseRobotsTxtFile */
	/* method to parse that file */
	if (URL == "/robots.txt")
	{
		return false;
	}
	else
	{
		/* iterate through our array checking for substring matches */
		foreach (string blockedURL in _BlockedUrls)
		{
			if (URL.Length >= blockedURL.Length)
			{
				if (URL.Substring(0, blockedURL.Length) == blockedURL)
				{
					Console.WriteLine("Blocked URL: " + blockedURL);
					
					/* found a DISALLOW command */
					return false;
				}
			}
		}
	}

	return true;
}

As you can see from this example its pretty easy to create code to accurately parse a Robots.txt file in .NET. The logic behind the parser is very simple and can easily be converted to any other language. Remember if your going to be crawling other peoples sites make sure you are obeying the sites Terms and Conditions especially if you are taking content to be displayed on another site. Most sites that are happy to allow certain parts of their content to be accessed and used by the public will provide an XML or RSS feed so look for a feed first. As well as saving you time in writing code to crawl and parse the sites content you will be saving server bandwidth and reducing the load which is always a good thing. I have banned numerous bots for continuing to take content by scrapping when they could easily use the available feeds for the simple reason that heavy loads affect all sites on my shared servers and its just bad manners and looks careless. I also ban bots that don't even bother looking at my robots.txt file or look at it and then don't follow the rules specified within it. Therefore its in your own best interests to make sure any crawler you use has the capability to load, parse and follow the commands specified in a Robots.txt file.

Post Comments

As this script is not part of the blog if you would like to post comments please click this link and then respond to the following article.