Crawl Algorithm and Priority
When crawling in "Spider mode", Dragonbot crawls URLs that are closest to the starting page first by prioritizing URLs by page depth, that is, the shortest number of steps it takes to get to a page from the homepage (measured by links).
Starting page in this context refers to the first set of URLs crawled. This is the value for "Website" in General Campaign Settings or "Other scope" if selected in Crawler Settings, in addition to any "Alternate Seed URLs" entered.
The Starting page(s) is considered to be a depth of 0, while all the pages linked to from this page would be a depth of 1. Pages that are linked to from depth = 1 pages will have a depth of 2. Keep in mind that this is measured based on the shortest route to the page. So if page A is linked to from the home page and a depth = 3 page, page A's depth will be 1 (and not 4, since this is not the shortest path from the home page.)
So when Dragonbot begins its crawl, it will begin with the starting page(s) (depth =0). It will crawl this page, index its content, and run a number of calculations on its data.
Next, Dragonbot will crawl all of the URLs that were linked to on the home page. These pages will have a depth of 1. After it has crawled all of the depth = 1 pages and indexed their content, Dragonbot will do the same for all the links it found on these pages (depth = 2). It will continue in this way until all links on the site have been crawled, or the crawl limit is reached.
Actually, not all links on the site will be crawled by Dragonbot. Pages that are outside of the scope (and therefore treated as an external link), pages blocked by the robots.txt file, or links to files that are not HTML documents will all be ignored by Dragonbot.
Dragonbot will follow 3xx redirects and meta refresh redirects, and continue crawling the redirect target URL, as long as it's within the scope of the site. If multiple redirects are linked together in a chain, Dragonbot will continue to crawl them until a maximum of 10 redirects. At this point, Dragonbot will give up, assuming it is a infinite redirect loop.
Dragonbot respects webmasters' and server resources by adhering to crawler politeness rules, including but not limited to:
- Respecting and following the robots.txt directive
- Limiting the interval and number of server requests per minute
- Self-identifying itself in the user agent string of the HTTP request header
The website address you add when creating the campaign determines the scope of the crawl. Dragonbot will not crawl outside this scope. The table below shows some examples:
Dragonbot will normalize URLs before crawling to eliminate unnecessarily crawling duplicate URLs. This process will treat a group of URLs as a single URL when their syntax is very similar and usually treated as the same URL by web browsers and search engines.
For example, in the group below, Dragonbot will only crawl the first URL, since we ignore everything after the "#" sign. This is because the links simply point to different areas of the same page.
Dragonbot's URL normalization includes (but is not limited to) processes such as:
- Converting the scheme and host to lower case (HTTP://www.Example.com/ → http://www.example.com/)
- Capitalizing letters in escape sequences (http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b)
- Decoding percent-encoded octets of unreserved characters. (http://www.example.com/~username/ → http://www.example.com/~username/)
- Removing the default port. (http://www.example.com:80/bar.html → http://www.example.com/bar.html)
- Adding trailing / (http://www.example.com/alice → http://www.example.com/alice/ )
- Removing the fragment (http://www.example.com/bar.html#section1 → http://www.example.com/bar.html)
- Sorting the query parameters (http://www.example.com/display?lang=en&article=fred → http://www.example.com/display?article=fred&lang=en)
- Removing the "?" when the query is empty (http://www.example.com/display? → http://www.example.com/display)
Like most crawlers, Dragonbot is bound by several limitations that prevent it from crawling sites in their entirety. Some of which are listed below. See Why doesn't Dragonbot find all the pages in my site? for more details.
- Dragonbot crawls a maximum of 500 links on each page
- Dragonbot downloads a maximum of 500KB on each page
- Dragonbot cannot crawl orphan pages (pages with no inbound links)
- Dragonbot cannot access content behind a login or pages that require cookies