How to crawl documents?

https://to.tosdr.org/wiki-crawl

How to Crawl Dynamic and Static Documents


This guide is more targeted to curators, as only those have access to adding and crawling documents.

When adding a service, it’s necessary to add their respective documents as well.

With recent changes, we also now support dynamic documents, meaning pages which render HTML only with Javascript, since the old crawler could not achieve that at all.

How to Add a Document


  1. Go to a service of your choice and if the service does not have any documents yet, a link Add Document will appear.

    If the service has documents, the link will display View Documents with a redirect to an overview of all documents. Click on Add a document to continue.

  2. Now a form appears. Fill out the necessary information of the document

    • Name: The name of the document — e.g. Privacy Policy
    • Url: The location of the document — e.g. https://example.com/privacy
    • XPath: The XML Path of the document. This is super important if the site renders dynamically, more below.
  3. Click on Crawl Document and watch the magic unfold.

FAQ


I have added the document but the text is not the one I expected.

This happens if you either set the wrong XPath OR if you have no XPath set and the site loads dynamically.

By setting an XPath our crawler will wait until that path is loaded and return the document, which is necessary for dynamic documents.

I have set everything correctly but the crawling fails.

This can be the cause of multiple problems. The most common one is the XPath is simply invalid. A guide on how to retrieve the XPath can be found below.

Additionally it can be that the website blocks our crawler or cloudflare simply blocks us. If this happens, open a thread in the Forum or drop us an email so we can investigate it further.

How can I get the XPath?

Check out our Guide:
How to use XPath properly

Where can I test the XPath? / Phoenix Sandbox

You can either use browser extensions or to see if the crawler works as intended use. service-965 can be used as a sandbox as it points to localhost. Test the documents there.

3 Likes