Version: Stable (v4.x)

Migrating from the legacy scraper

Introduction

With the new version of the DocSearch UI, we wanted to go further and provide better tooling for you to create and maintain your config file, and some extra Algolia features that you all have been requesting for a long time!

What's new?

Scraper

The DocSearch infrastructure now leverages the Algolia Crawler. We've teamed up with our friends and created a new DocSearch helper, that extracts records as we were previously doing with our beloved DocSearch scraper!

The best part is that you no longer need to install any tooling on your side if you want to maintain or update your index!

We now provide a web interface legacy or new that will allow you to:

Start, schedule and monitor your crawls
Edit your config file from our live editor
Test your results directly with DocSearch v3 or DocSearch v4

Algolia application and credentials

We've received a lot of requests asking for:

A way to manage team members
Browse and see how Algolia records are indexed
See and subscribe to other Algolia features

They are now all available, in your own Algolia application, for free :D

FAQ

You can find answers related to the DocSearch migration in our Crawler FAQ page.

Useful links

Config file key mapping

Below are the keys that can be found in the legacy DocSearch configs and their translation to an Algolia Crawler config. For more detailed information on the Algolia Crawler, see the official documentation.

`legacy`	`current`	description
`start_urls`	`startUrls`	Now accepts URLs only, see `helpers.docsearch` to handle custom variables
`page_rank`	`pageRank`	Can be added to the `recordProps` in `helpers.docsearch`, should be passed as a string
`js_render`	`renderJavaScript`	Unchanged
`js_wait`	`renderJavascript.waitTime`	See documentation of `renderJavaScript`
`index_name`	removed, see `actions`	Handled directly in the `actions`
`sitemap_urls`	`sitemaps`	Unchanged
`stop_urls`	`exclusionPatterns`	Supports `micromatch`
`selectors_exclude`	removed	Should be handled in the `recordExtractor` and `helpers.docsearch`
`custom_settings`	`initialIndexSettings`	Unchanged
`scrape_start_urls`	removed	Can be handled with `exclusionPatterns`
`strip_chars`	removed	`#` are removed automatically from anchor links, edge cases should be handled in the `recordExtractor` and `helpers.docsearch`
`conversation_id`	removed	Not needed anymore
`nb_hits`	removed	Not needed anymore
`sitemap_alternate_links`	removed	Not needed anymore
`stop_content`	removed	Should be handled in the `recordExtractor` and `helpers.docsearch`

Introduction​

What's new?​

Scraper​

Algolia application and credentials​

FAQ​

Useful links​

Config file key mapping​