This page explains in more detail how the crawler extracts content from your page every 24 hours, and how it ranks the results.
Each crawl begins its journey at
start_urls value specified in your config. It
will read those pages, recursively extract and follow every link in those pages
until it has browsed every compliant page.
If you have explicitly defined a
sitemap.xml, our crawler will scrape every
provided and compliant page. We do recommend using a sitemap since it
explicitly exposes URLs to crawl and avoid missing pages that aren't linked from
Building records using the scraper is pretty intuitive. Based to your settings, we extract the payload of your web page and index it, preserving your data structure. It achieves this in a simple way:
- We read top down your web page following your HTML flow and pick out your
matching elements according to their levels based on the
- We create a record for each paragraph along with its hierarchical path. This construction is based on their time of appearance along the flow.
- We index these records with the appropriate global settings (e.g. metadata, tags, etc.)
Note: The above process performs sanity tests as it scrapes to detect errors. If there are any serious warnings, it aborts and hence does not overwrite your current index. These checks ensure that your dedicated index isn't flushed.
Algolia always returns the most relevant results first, using a tie-breaking
approach. DocSearch will first search for exact matches in your keywords,
and then fallback to partial matches. It sorts those results, once again, on the
page hierarchy, as extracted from the
The default strategy is to promote records having matching words in the highest
level first. Thus if two results have the same matching words, the one having
them in the highest level (
lvl0) will be ranked higher. We also use the
position of the matching words. The sooner they appear within the HTML flow, the
higher the record will be ranked.
We base relevancy on several factors and customize it according to the Algolia tie-breaking method.
You can boost pages depending on their URLs. You should use the
page_rank attributes. Its value is a numeric value (defaults to 0). The
higher the value is, the higher results from the matching pages are ranked. For
example, all pages with a
page_rank of 5 will be returned before pages with a
page_rank of 1.
You could even change the relevancy strategy by overwriting the default
customRanking used by the index by using the
custom_settings option of