What is an alternate seed URL?

When crawling in Spider mode, our crawler will begin by first crawling the website URL set for this campaign. It will follow any redirects and recursively crawl all links found on each page.

There may be times when you would like to change which URL the crawler begins from. You can do this by entering one or more Alternate Seed URLs. These will be treated as additional starting points for the crawler. All Alternate Seed URLs will be treated just like the website URL for this campaign, each with a crawl depth of 0.

When to use alternate seed URLs

Most sites will not need alternate seed URLs, but they can be very helpful in a variety of situations.

Deep crawling large sites

Our crawler (Dragonbot) is a breadth-first crawler, meaning it will try to crawl pages closest to the starting page first before crawling pages linked deeper in the site.

Because of this, very large sites or those with a very deep internal linking architecture may have trouble ensuring all URLs on the site get crawled.

To ensure that URLs at all depths are crawled, a list of URLs at each depth can be submitted as alternate seed URLs to start the crawl from very deep in the site as well as URLs closer to the home page.

Disconnected areas of a site

It's possible that some areas of the site may not be connected well with internal links, and therefore may not be included in the site.

For example, if all subdomains are included in the crawl, but crawling begins from the www subdomain, if there are no links from the www subdomain to the "docs" subdomain, the docs subdomain will never be crawled.

Therefore, adding an alternate seed URL for the docs subdomain will ensure this area of the site is included in the crawl.

No content exists at campaign website URL

If the campaign website URL does not have any content, and does not properly redirect to a page that has links, the crawl will stop after the first page.

For example, the campaign website is entered as "example.com" to include all subdomains in the crawl scope. The home page is https://www.example.com/ but no content exists on URLs without a subdomain (e.g. https://example.com/)

In this situation, our crawler will begin crawling from the campaign website "example.com", which will be interpreted as https://example.com/. This URL returns an HTTP 404 error, and since it's the first and only URL in the crawl so far, the crawl stops and fails after a single URL.

By setting an alternate seed URL as https://www.example.com/, we can keep the campaign URL as "example.com" and still crawl the site successfully.

How to add alternate seed URLs

Alternate seed URLs can be added or updated on the Crawler Settings under Campaign Settings in the left navigation.

Enter alternate seed URLs, one-per-line in the text box at the bottom of the page. Note: A maximum of 30 alternate seed URLs can be entered.

Click Save changes when you are finished.

By default, the new settings will take effect with the next scheduled crawl. If you wish to crawl immediately, you can initiate a manual crawl.

Did this answer your question?