The robots.txt file is a plain text document placed at the root of your site that contains instructions to robots (search engines or any other type of crawlers) on which pages should or should not be accessible to them. By using this file properly, you can prevent search engines or other crawlers from accessing or indexing certain parts of your site.

A typical robots.txt file:

User-agent: *
Disallow: /temp/
Disallow: /private/
Disallow: /user/

The specific syntax must be used to make these exclusions. You can give individual instructions to each crawler separately, or give general instructions that all crawlers should follow. You can exclude individual pages or entire subdirectories, or use wildcards to do simple pattern matching for URL parameters or other partial matches.

The file must be named "robots.txt" and placed directly at the root of the site. Because of this, each subdomain will be treated separately and needs its own robots.txt file. (e.g. You will need both http://www.example.com/robots.txt and http://sub.example.com/robots.txt for both of these subdomains to be treated correctly).

The robots.txt file is a guideline for crawlers, but the file itself does nothing to enforce these guidelines. Most major search engines and polite crawlers will cooperate by following the guidelines set out in the robots.txt file, but there is no guarantee that all crawlers will obey them.

Some Guidelines on Writing a Good Robots.txt File

  • Do not disallow any URL that you want to be ranked on search engines

  • Learn to use wildcards to do pattern matching in URLs. This can be very effective for large sites, or sites with many URL parameters. Oftentimes sites using URL parameters can cause duplicate content issues.

  • Remember that excluding a page does not guarantee it will not be crawled or indexed. Keep private data offline or behind a secure login.

  • Keep in mind that excluding a page does not exclude it from being indexed in search engines. If other pages point to this URL, the page could still be indexed without being crawled. If you want to block the URL from being indexed, use another method such as the noindex directive.

  • Remember each subdomain needs its own robots.txt file

  • Test your robots.txt file before uploading 一 a mistake can be costly and undo months of SEO efforts

Did this answer your question?