By crawling a service we ensure that the documents are mirrored and cannot be altered until a further crawl (Verified using CRC)
We do not index websites on our own, all websites are crawled manually by curators or staff on our site.
Identifying the ToS;DR Crawler
All ToS;DR Crawlers send a respective user agent with all their requests
Check for the following user agent:
ToSDRCrawler/1.0.0 (+https://to.tosdr.org/bot)
robots.txt
If you want to forbid the crawling for some reason you can include the following directive into the robots.txt
User-Agent: TosDRCrawler
Disallow: YOUR_PATH
Crawler Clusters
Crawler | Location | IP | rDNS | Useragent | DNS | Port | Notes | Internal IP |
---|---|---|---|---|---|---|---|---|
Atlas |
![]() |
202.61.251.191 |
![]() |
![]() |
atlas.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.7 |
Arachne |
![]() |
45.136.28.177 | ![]() |
![]() |
arachne.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.6 |
AvidReader |
![]() |
37.120.165.131 | ![]() |
![]() |
avidreader.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.1 |
Floppy |
![]() |
37.120.177.70 | ![]() |
![]() |
floppy.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.2 |
James |
![]() |
185.228.137.101 | ![]() |
![]() |
james.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.3 |
NosyPeeper |
![]() |
188.68.49.4 | ![]() |
![]() |
nosypeeper.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | 10.0.0.4 |
Terra |
![]() |
87.78.131.160 | ![]() |
![]() |
terra.crawler.api.tosdr.org | 0.0.0.0:6874:6874 | Backup Only | N/A |
Whale |
![]() |
157.245.142.64 | ![]() |
![]() |
whale.crawler.api.tosdr.org | 0.0.0.0:80:6874 | Ignores Robots.txt | N/A |
Crawler problems
If you are the provider of the website, common crawling issues are
- Cloudflare
- robots.txt
- IPTables based restriction (See Crawler Clusters)
- User-Agent based blocking
To fix this, add our servers or user agents to their respective whitelist.
Error codes, what do they mean?
Error | Explanation | Fix |
---|---|---|
Reason: Error Stacktrace: write EPROTO 140022019606400:error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:…/deps/openssl/openssl/ssl/statem/statem_clnt.c:2157: | This SSL Error means a secure connection could not be established as a handshake cipher is possibly too old. | Update the Ciphersuits on your webserver SSL configuration |
Reason: StatusCodeError Stacktrace: Expected status code 200:OK; got 403:Forbidden | The website blocks our crawler. Most likely this is cloudflare | Whitelist our Crawler Cluster or Useragent |
Please check that the XPath and URL are accurate. | The xpath you retrieved is possibly stored in an IFrame, we cannot crawl those. Or its simply the wrong XPath | Get the raw link from the iframe and use the xpath there. |
MimeType {MIMETYPE} is not in our whitelist | The document you crawled is not support by our server | Fix the mimetype or suggest the mimetype to be supported. |