What is the Robots.txt file?
The robots.txt file is a plain text document placed in the root of your site that contains instructions to robots (search engines or any other type of crawlers) on which pages should or should not be accessible to them. By using this file properly, you can prevent search engines or other crawlers from accessing or indexing certain parts of your site.
A typical robots.txt file:
User-agent: *Disallow: /temp/Disallow: /private/Disallow: /user/
Specific syntax must be used to make these exclusions. You can give individual instructions to each crawler separately, or give general instructions that all crawlers should follow. You can exclude individual pages or entire subdirectories, or use wildcards to do simple pattern matching for URL parameters or other partial matches.
The file must be named "robots.txt" and placed directly in the root of the site. Because of this, each subdomain will be treated separately, and needs its own robots.txt file. (e.g. You will need both http://www.example.com/robots.txt and http://sub.example.com/robots.txt for both of these subdomains to be treated correctly).
The robots.txt file is a guideline for crawlers, but the file itself does nothing to enforce these guidelines. Most major search engines and polite crawlers will cooperate by following the guidelines set out in the robots.txt file, but there is no guarantee that all crawlers will obey them.
What does it mean if a URL is blocked by robots.txt?
If a URL is blocked by the robots.txt file, it means that it has been excluded from being crawled via instructions on the robots.txt file. This means that search engines will not be aware of the content on this page, it will not be included in search engines' indexes, and you will receive no direct organic traffic to this URL.
What Should I Do With This Data?
Look through the URLs to see if there are any pages that you want accessible via organic search. If you find any pages that are unintentionally in this list, you will want to modify the robots.txt file so that they are not blocked.
You may also want to look through the list to see if there are any URLs that you expected would be blocked via the robots.txt that you don't see in this list. It may be possible that the robots.txt file is not written correctly. (However, it's possible that Dragon Metrics simply did not include this page in the crawl.)
Some Guidelines on Writing a Good Robots.txt File
- Do not disallow any URL that you want ranked on search engines
- Use the robots.txt to exclude pages with duplicate content
- Learn to use wildcards to do pattern matching in URLs. This can be very effective for large sites, or sites with many URL parameters. Often times sites using URL parameters can cause duplicate content issues.
- Remember that excluding a page does not guarantee it will not be crawled or indexed. Keep private data offline or behind a secure login.
- Remember each subdomain needs its own robots.txt file
- Test your robots.txt file before uploading - a mistake can be costly and undo months of SEO efforts
How Can I Test My Robots.txt File?
Google Webmaster Tools offers an excellent good robots.txt debugger. Simply log into your Google Webmaster Tools account and go to Crawl > Blocked URLs. From there you can paste in your robots.txt file and the URLs you want to test, to verify whether the pages are being blocked as you intended.