Migrating from the legacy scraper
Introduction​
With the new version of the DocSearch UI, we wanted to go further and provide better tooling for you to create and maintain your config file, and some extra Algolia features that you all have been requesting for a long time!
What's new?​
Scraper​
The DocSearch infrastructure now leverages the Algolia Crawler. We've teamed up with our friends and created a new DocSearch helper, that extracts records as we were previously doing with our beloved DocSearch scraper!
The best part, is that you no longer need to install any tooling on your side if you want to maintain or update your index!
We now provide a web interface legacy or new that will allow you to:
- Start, schedule and monitor your crawls
- Edit your config file from our live editor
- Test your results directly with DocSearch v3 or DocSearch v4
Algolia application and credentials​
We've received a lot of requests asking for:
- A way to manage team members
- Browse and see how Algolia records are indexed
- See and subscribe to other Algolia features
They are now all available, in your own Algolia application, for free :D
FAQ​
You can find answers related to the DocSearch migration in our Crawler FAQ page.
Useful links​
Config file key mapping​
Below are the keys that can be found in the legacy
DocSearch configs and their translation to an Algolia Crawler config. More detailed documentation of the Algolia Crawler can be found on the the official documentation
legacy | current | description |
---|---|---|
start_urls | startUrls | Now accepts URLs only, see helpers.docsearch to handle custom variables |
page_rank | pageRank | Can be added to the recordProps in helpers.docsearch , should be passed as a string |
js_render | renderJavaScript | Unchanged |
js_wait | renderJavascript.waitTime | See documentation of renderJavaScript |
index_name | removed, see actions | Handled directly in the actions |
sitemap_urls | sitemaps | Unchanged |
stop_urls | exclusionPatterns | Supports micromatch |
selectors_exclude | removed | Should be handled in the recordExtractor and helpers.docsearch |
custom_settings | initialIndexSettings | Unchanged |
scrape_start_urls | removed | Can be handled with exclusionPatterns |
strip_chars | removed | # are removed automatically from anchor links, edge cases should be handled in the recordExtractor and helpers.docsearch |
conversation_id | removed | Not needed anymore |
nb_hits | removed | Not needed anymore |
sitemap_alternate_links | removed | Not needed anymore |
stop_content | removed | Should be handled in the recordExtractor and helpers.docsearch |