Difference between revisions of "Chromebot"
(+info on Wayback Machine ingestion and matching URLs to items) |
(has been shut down) |
||
Line 1: | Line 1: | ||
'''chromebot''' aka. '''crocoite''' | '''chromebot''' aka. '''crocoite''' was an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down. [[WARC]]s were uploaded twice a day to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file. | ||
By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [http://chromebot.6xq.net/ dashboard] is available for watching the progress of such jobs. | By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [http://chromebot.6xq.net/ dashboard] is available for watching the progress of such jobs. | ||
== Usage == | == Usage == | ||
[https:// | [https://6xq.net/crocoite/usage/ crocoite usage documentation] | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 27: | Line 25: | ||
<code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code> | <code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code> | ||
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be | chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment). | ||
== Restrictions == | == Restrictions == |
Revision as of 20:58, 24 April 2021
chromebot aka. crocoite was an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down. WARCs were uploaded twice a day to the chromebot collection on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.
By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.
Usage
Command | Description |
---|---|
|
Archive <url> with <concurrency> processes according to recursion <policy>. |
chromebot: s <uuid> |
Get job status for <uuid>. |
chromebot: r <uuid> |
Revoke or abort running job with <uuid>. |
Please note that the commands are case-sensitive.
URL lists can be archived using recursion, for example:
chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).
Restrictions
chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:
<Instagram.com URL> cannot be queued: Banned by Instagram
Cloudflare DDoS protection
chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).
People
PurpleSym maintains software, scripts, pays the server bills and has administrative access. katocala is a server administrator.