I’ve been asked about this file in many projects i’ve worked on. It resides in the root of the website, and has no external references to it, however, there is usually a lot of requests for it in the server logs. (Or… “404 Not Found” Errors if it doesn’t exist).
Additionally, automated security audit software will often indicate that this file itself is a possible security problem as it can expose hidden areas of your website (more on this later).
Here’s what it’s all about….
ROBOTS.TXT is used by spiders and robots, primarily so that they can index your website for search engines (which is usually a good thing). However…. there are times when you don’t want this to occur. Some spiders/robots can be too agressive on your servers, consuming precious bandwidth and CPU utilization, or they can dig too deep into your content. As such you might want to control their access.
The Robots Exclusion Protocol sets out several ways to accomplish this goal. Of course the spider must comply with this convention.
1. ROBOTS.TXT can be used to limit the access:
#robots.txt - for info see http://www.robotstxt.org/wc/robots.html
2. A <meta> tag on each webpage indicating spider actions to take.
<meta name="robots" content="INDEX, FOLLOW, ALL" />
Values, there are a few other attributes, but these are the most common….
INDEX -index this page
NOINDEX – do not index this page
FOLLOW -follow links from this page
NOFOLLOW -do not follow links from this page
ALL – same as INDEX, FOLLOW
NONE – same as NOINDEX, NOFOLLOW
In most cases, a spider/robot will first request the ROBOTS.TXT file, and then start indexing the site. You can exclude all or specific spiders from individual files or directories.
As for the earlier bit on security, since this file is available to anyone, you should never indicate sensitive areas of your website in this file as it would be an easy way to find those areas.