The file Program.cs contains a simple console application to test the code.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Start Bot Crawl Test");
string SiteURL = "http://blog.strictly-software.com";
/* Pass in the URL for the site or any URL for a site and this method */
/* will create the correct path to access the sites Robot.txt file */
/* and then parse it. */
Robot.ParseRobotsTxtFile(SiteURL);
/* This URL will be allowed by the robots.txt */
string URL = "http://blog.strictly-software.com/unpacker.asp";
/* Once parsed we can check whether URLs can be accessed following */
/* the rules in the sites Robots.txt file. */
if (Robot.URLIsAllowed(URL))
{
Console.WriteLine("This URL is allowed by the Robots.txt file");
}
else
{
Console.WriteLine("This URL is NOT allowed by the Robots.txt file");
}
/* This URL won't be allowed as a command in my robots.txt will block it */
URL = "http://blog.strictly-software.com/search?q=robot";
if (Robot.URLIsAllowed(URL))
{
Console.WriteLine("This URL is allowed by the Robots.txt file");
}
else
{
Console.WriteLine("This URL is NOT allowed by the Robots.txt file");
}
Console.WriteLine("End Bot Crawl Test");
}
}
As you can see I am accessing the Robots.txt file from my own blog which contains the following rules:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Sitemap: http://blog.strictly-software.com/feeds/posts/default?orderby=updated
Its a pretty basic Robots.txt file that allows Googles Adsense bot access to the whole of my site with the Disallow: command. When
the command is blank it behaves like a negative and therefore says to Allow access to everything. The only restricted part of the site is the /search directory which
is blocked for all agents with the User-agent: * command.
Although the only standard commands are Disallow and User-agent some crawlers such
as Googlebot also obey commands such as Allow and Sitemap.
I have set the user-agent that the Robot uses to my own string Mozilla 5.0; RobsRobot 1.2; www.strictly-software.com;
which means that when I call the Robot.URLIsAllowed(URL) method it should return saying I am allowed to access the URL
http://www.strictly-software.com/unpacker.asp.
To test that the opposite is true you should change the URL that is passed into the URLIsAllowed method to be http://blog.strictly-software.com/search?q=robot as this will fall foul of the second rule in my Robots.txt file that blocks all agents from accessing the path: /search e.g:
/* This URL will not be allowed due to a command in my robots.txt */
string URL = "http://blog.strictly-software.com/search?q=robot";
As you will see when running the application from the console the code handles the fact that the rule in the robots.txt is relative and the URL passed to the function is absolute as well as having a querystring. The code that handles this is the URLIsAllowed method: