The robots.txt file is a plain text document placed in the root of your site that contains instructions to robots (search engines or any other type of crawlers) on which pages should or should not be accessible to them. By using this file properly, you can prevent search engines or other crawlers from accessing or indexing certain parts of your site.
A typical robots.txt file:
Specific syntax must be used to make these exclusions. You can give individual instructions to each crawler separately, or give general instructions that all crawlers should follow. You can exclude individual pages or entire subdirectories, or use wildcards to do simple pattern matching for URL parameters or other partial matches.
The file must be named "robots.txt" and placed directly in the root of the site. Because of this, each subdomain will be treated separately, and needs its own robots.txt file. (e.g. You will need both http://www.example.com/robots.txt and http://sub.example.com/robots.txt for both of these subdomains to be treated correctly).
The robots.txt file is a guideline for crawlers, but the file itself does nothing to enforce these guidelines. Most major search engines and polite crawlers will cooperate by following the guidelines set out in the robots.txt file, but there is no guarantee that all crawlers will obey them.
Guidelines for Writing a Good Robots.txt File
- Do not disallow any URL that you want ranked on search engines
- Use the robots.txt to exclude pages with duplicate content
- Learn to use wildcards to do pattern matching in URLs. This can be very effective for large sites, or sites with many URL parameters. Often times sites using URL parameters can cause duplicate content issues.
- Remember that excluding a page does not guarantee it will not be crawled or indexed. Keep private data offline or behind a secure login.
- Remember each subdomain needs its own robots.txt file
- Test your robots.txt file before uploading - a mistake can be costly and undo months of SEO efforts