For each DocSearch request we receive, we create a custom JSON configuration file that defines how the crawler should behave. You can find all the configs in this repository.
A DocSearch looks like this:
This is the name of the Algolia index where your records will be pushed. The
apiKey we will share with you is restricted to work with this index and is a
When using the free DocSearch crawler, the
indexName will always be the name
of the configuration file. If you're running DocSearch yourself, you can use any
name you'd like.
When the DocSearch scraper runs, it builds a temporary index. Once scraping is
complete, it moves that index to the name specified by
the existing index).
By default, the name of the temporary index is the value of
To use a different name, set the
INDEX_NAME_TMP environment variable to a
different value. This variable can be set in the .env file alongside
This array contains the list of URLs that will be used to start crawling your
website. The crawler will recursively follow any links (
<a> tags) from those
pages. It will not follow links that are on another domain and never follow
selectors_key, tailor your selectors
You can define finer sets of selectors depending on the URL. You need to use the
selectors_key from your
To find the right subset to use based on the URL, the scraper iterates over
start_urls items. Only the first one to match is applied.
Considering the URL
http://www.example.com/en/api/ with the configuration:
Only the set of selectors related to
doc will be applied to the URL. The
correct configuration should be built the other way around (as primarily
start_urls item has no
selectors_key defined, the
default set will
be used. Do not forget to set this fallback set of selectors.
Using regular expressions
stop_urls options also enable you to use regular
expressions to express more complex patterns. This object must at least contain
url key targeting a reachable page.
You can also define a
variables key that will be injected into your specific
URL pattern. The following example makes this variable feature clearer:
The beneficial side effect of using this syntax is that every record extracted
from pages matching
http://www.example.com/docs/en/latest will have attributes
lang: en and
version: latest. It enables you to filter on these
The following example shows how the UI filters results matching a specific language and version.
Using custom tags
You can also apply custom tags to some pages without the need to use regular
expressions. In that case, add the list of tags to the
tags key. Note that
those tags will be automatically added as facets in Algolia, allowing you to
filter based on their values as well.
From the JS snippet:
Using Page Rank
To give more weight to some pages. This parameter helps to boost records built
from the page. Pages with highest
page_rank will be returned before pages with
page_rank. Note that you can pass any numeric value, including
In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.
Using custom selectors per page
If the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.
Here, all documentation pages will use the selectors defined in
selectors.default while the page under
./concepts will use
selectors.concepts and those under
./contributors will use
This object contains all the CSS selectors that will be used to create the
record hierarchy. It can contain up to 6 levels (
A default config would be to target the page
text, but this is highly dependent on
text key is mandatory, but we highly recommend setting also
lvl2 to have a decent depth of relevance.
Selectors can be passed as strings, or as objects containing a
Other special keys can be set, as documented below.
Using global selectors
The default way of extracting content through selectors is to read the HTML markup from top to bottom. This works well with semi-structured content, like a hierarchy of headers. This breaks when the relevant information is not part of the same flow. For example when the title is not part of a header or sidebar.
For that reason, you can set a selector as global, meaning that it will match on the whole page and will be the same for all records extracted from this page.
We do not recommend
text selectors to be global.
Setting a default value
If a selector doesn't match a valid element on the page, you can define a
default_value as a fallback.
Removing unnecessary characters
Some documentation adds special characters to headings, like
characters have a stylistic value but no meaning and shouldn't be indexed in the
You can define a list of characters you want to exclude from the final indexed
value by setting the
Note that you can also define
strip_chars directly at the root of the
configuration and it will be applied to all selectors.
Targeting elements using XPath instead of CSS
CSS selectors are a clear and concise way to target elements of a page, but they have a limitations. For example, you cannot go up the cascade with CSS.
If you need a more powerful selector mechanism, you can write your selectors
using XPath by setting
The following example will look for a
li.chapter.active.done and then go up
two levels in the DOM until it finds a
a. The content of this
a will then be
used as the value of the
XPath selector can be hard to read. We highly encourage you to test them in your browser first, making sure they match what you're expecting.
This key can be used to overwrite your Algolia index settings. We don't recommend changing it as the default settings are meant to work for all websites.
One use case would be to configure the
separatorsToIndex setting. By default
Algolia will consider all special characters as a word separator. In some
contexts, like for method names, you might want
# to keep their
Check the Algolia documentation for more information about the Algolia settings.
custom_settings can include a
synonyms key that is an array of synonyms.
Each element is an array of one-word synonym. These words are interchangeable.
Note that you can use advanced synonym with Algolia. Our scraper only supports regular one-word synonyms.
By default, the crawler will extract content from the pages defined in
start_urls. If you do not have any valuable content on your
if it's a duplicate of another page, you should set this to
This expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.
This can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.
This is an array of strings or regular expressions. Whenever the crawler is about to visit a link, it will first check if the link matches something in the array. If it does, it will not follow the link. This should be used to restrict pages the crawler should visit.
Note that this is often used to avoid duplicate content, by adding
http://www.example.com/docs/index.html if you already have
http://www.example.com/docs/ as a
The default value is
0. By increasing it, you can choose not to index some
records if they don't have enough
lvlX matching. For example, with a
min_indexed_level: 2, the scraper indexes temporary records having at least
lvl2 set. You can find out more details about this
strategy in this section.
This is useful when your documentation has pages that share the same
lvl1 for example. In that case, you don't want to index all the shared
records, but want to keep the content different across pages.
only_content_level is set to
true, then the crawler won't create
records for the
min_indexed_level is ignored.
The number of records that were extracted and indexed by DocSearch. We check this key internally to keep track of any unintended spike or drop that could reveal a misconfiguration.
nb_hits is updated automatically each time you run DocSearch on your config.
If the term is a tty, DocSearch will prompt you before updating the field. To
avoid being prompted, set the
UPDATE_NB_HITS environment variable to
(to enable) or
false (to disable). This variable can be set in the .env file
You don't have to edit this field. We're documenting it here in case you were wondering what it's all about.
If your website has a
sitemap.xml file, you can let DocSearch know and it will
use it to define which pages to crawl.
You can pass an array of URLs pointing to your sitemap(s) files. If this value
is set, DocSearch will try to read URLs from your sitemap(s) instead of
following every link of your
You must explicitly defined this parameter, our scraper doesn't follow
Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default DocSearch will ignore those URLs.
Set this to
true if you want those other versions to be crawled as well.
With the above configuration and the
sitemap.xml below, both
http://www.example.com/docs/de/ will be
By default DocSearch expects websites to have server-side rendering, meaning that HTML source is returned directly by the server. If your content is generated by the front-end, you have to tell DocSearch to emulate a browser through Selenium.
As client-side crawl is way slower than server-side crawl, we highly encourage you to update your website to enable server-side rendering.
Set this value to true if your website requires client-side rendering. This will make DocSearch spawn a Selenium proxy to fetch all your web pages.
If your website is slow to load, you can use
js_wait to tell DocSearch to wait
a specific amount of time (in seconds) for the page to load before extracting
Note that this option might have a large impact on the time required to crawl your website and we would encourage you to enable server-side rendering on your website instead.
This option has no impact if
js_render is set to
Websites using client-side rendering often don't use full URLs, but instead take
advantage of the URL hash (the part after the
If your website is using such URLs, you should set
DocSearch to index all your content.
You can override the user agent used to crawl your website. By default, this value is:
However, if the crawl of your website requires a browser emulation (i.e.
To override it, from the configuration: