ToS;DR Crawler

JustinBack
January 30
The ToS;DR Crawler is important to the functionally of Phoenix.

By crawling a service we ensure that the documents are mirrored and cannot be altered until a further crawl (Verified using CRC)

We do not index websites on our own, all websites are crawled manually by curators or staff on our site.

Identifying the ToS;DR Crawler

All ToS;DR Crawlers send a respective user agent with all their requests

Check for the following user agent:

ToSDRCrawler/1.0.0 (+https://to.tosdr.org/bot)

robots.txt

If you want to forbid the crawling for some reason you can include the following directive into the robots.txt

User-Agent: TosDRCrawler
Disallow: YOUR_PATH

:white_check_mark:

Crawler Clusters

Crawler Location IP rDNS Useragent DNS Port Notes Internal IP
Atlas :austria: - EU 202.61.251.191 :white_check_mark: (Partially) :white_check_mark: atlas.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.7
Arachne :de: - EU 45.136.28.177 :white_check_mark: :white_check_mark: arachne.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.6
AvidReader :de: - EU 37.120.165.131 :white_check_mark: :white_check_mark: avidreader.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.1
Floppy :de: - EU 37.120.177.70 :white_check_mark: :white_check_mark: floppy.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.2
James :de: - EU 185.228.137.101 :white_check_mark: :white_check_mark: james.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.3
NosyPeeper :de: - EU 188.68.49.4 :white_check_mark: :white_check_mark: nosypeeper.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt 10.0.0.4
Terra :de: - EU 87.78.131.160 :no_entry_sign: :white_check_mark: terra.crawler.api.tosdr.org 0.0.0.0:6874:6874 Backup Only N/A
Whale :virginia: - US 157.245.142.64 :no_entry_sign: :white_check_mark: whale.crawler.api.tosdr.org 0.0.0.0:80:6874 Ignores Robots.txt N/A

Crawler problems

If you are the provider of the website, common crawling issues are

To fix this, add our servers or user agents to their respective whitelist.

Error codes, what do they mean?

image

Error Explanation Fix
Reason: Error Stacktrace: write EPROTO 140022019606400:error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:…/deps/openssl/openssl/ssl/statem/statem_clnt.c:2157: This SSL Error means a secure connection could not be established as a handshake cipher is possibly too old. Update the Ciphersuits on your webserver SSL configuration
Reason: StatusCodeError Stacktrace: Expected status code 200:OK; got 403:Forbidden The website blocks our crawler. Most likely this is cloudflare Whitelist our Crawler Cluster or Useragent
Please check that the XPath and URL are accurate. The xpath you retrieved is possibly stored in an IFrame, we cannot crawl those. Or its simply the wrong XPath Get the raw link from the iframe and use the xpath there.
MimeType {MIMETYPE} is not in our whitelist The document you crawled is not support by our server Fix the mimetype or suggest the mimetype to be supported.