Dragonbot is the name of the crawler Dragon Metrics uses to collect onsite data to be used in features such as Site Explorer and Site Audit.

Crawl Frequency

If crawling is enabled for a campaign, Dragon Metrics will schedule Dragonbot to run a crawl of the website within a few hours. After the initial crawl, Dragonbot will return to your site approximately 14 days later. As soon as the crawl has completed, the data will immediately be made available on Site Explorer and Site Audit.

At this time, it is not possible to request additional crawls or to specify the time Dragonbot will visit your site. If you've recently made changes to your site, you will have to wait until the next crawl date to see them reflected in Dragon Metrics.

How to Identify Dragonbot

Dragonbot identifies itself as "dragonbot" in the user agent field of the HTTP request header when it crawls a site. You can identify it as such in your server's logs.

Politeness

Dragonbot respects webmasters' and server resources by adhering to crawler politeness rules, including but not limited to:

  • Respecting and following the robots.txt directive
  • Limiting the interval and number of server requests per minute
  • Self-identifying itself in the user agent string of the HTTP request header

Crawl Scope

The website address you add when creating the campaign determines the scope of the crawl. Dragonbot will not crawl outside this scope. The table below shows some examples:

Website - Scope

Restricting Dragonbot

Dragonbot respects the robots.txt directive. You can prevent Dragonbot from crawling specific areas of your site by writing directives for the user agent "dragonbot". In the future, we plan on introducing additional features that will make it easier to do this in Dragon Metrics without having to alter the robots.txt file.

For example, to restrict Dragon Metrics from crawling any page on your site, add the following code to the robots.txt file in the root domain (or any other subdomain Dragon Metrics is set to crawl):

User-agent: Dragonbot
Disallow: /

To restrict Dragon Metrics from crawling a directory on your site called "private", use the following code:

User-agent: Dragonbot
Disallow: /private

Dragonbot also supports wildcard characters. For example, to restrict Dragonbot from crawling any page that contains the word "user" anywhere in the URL (besides the domain), use the following code:

User-agent: Dragonbot
Disallow: /*user

To restrict Dragonbot from crawling any page that contains URL parameters, use the following code:

User-agent: Dragonbot
Disallow: /*?

Crawl Issues

There could be many reasons why Dragonbot does not find all of the pages in your site:

No Links to the Page

Dragonbot works the same way other web crawlers do - Dragonbot crawls all of the links on your home page and visits them. Then it crawls all the links on this page, and so on. If there are no links to the pages you want crawled, Dragonbot will not be able to find them.

Links in JavaScript, iFrames, Java, or Flash

Dragonbot only crawls content and links in plain text. Therefore, if there are links in JavaScript, iFrames, Java, Flash, or some other format other than plain text, Dragonbot will not be able to discover or follow these links.

Page Size Too Large

Dragonbot only downloads the first 500KB of each page. If you have a very large page with links at that the bottom, these links are likely to not be crawled or discovered by Dragonbot.

Too Many Links on Page

Dragonbot follows a maximum of 500 links per page. Links above this limit will not be followed or crawled.

Outside of Website Scope

The website address you add when creating the campaign determines the scope of the crawl. Dragonbot will not crawl outside this scope. It's possible that the page or pages that are missing from your crawl have fallen outside the scope of the website as it's been entered. See the table above for examples.

Redirecting Based on IP Address

Some multinational websites will sniff the users' IP address, and redirect them to a different domain based on their location. (For example, users from the UK may be redirected to www.example.co.uk when they try to visit www.example.com, while users from China may be redirected to www.example.cn when they visit www.example.com.)

When this happens, in many cases Dragonbot will not be able to crawl the site, depending on how the redirects are done.

Outside Crawl Limits

Dragonbot is limited by the number of crawl credits assigned to each campaign. Therefore, if the crawl limits are set to 10,000, Dragonbot will only crawl 10,000 URLs on your site. URLs blocked by the robots.txt file do not count towards this limit.

Some sites may be larger than the crawl limits, so for these sites, Dragonbot will crawl URLs prioritized by depth (link distance from the home page). Page that may be several links away from the home page may not be crawled.

Disallowed by Robots.txt

Dragonbot follows the robots.txt directive. If Dragonbot is blocked in this file, it will not crawl the site or parts of the site that is blocked.

Content Behind Login

Any content that users must log in for will not be accessible to Dragonbot.

Cookies

Some sites that use cookies to redirect users may confuse Dragonbot, and it will not be able to crawl the site effectively.

Server Issues

It's possible that on occasion a site may be temporarily down when Dragonbot crawls the site. In this case, you will usually see these pages listed as a HTTP 503, 500, or other error message.

If you're still having issues not listed above, please contact Dragon Metrics support.

Did this answer your question?