Difference between revisions of "Chromebot"

Revision as of 20:58, 24 April 2021

chromebot aka. crocoite was an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down. WARCs were uploaded twice a day to the chromebot collection on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation

Command	Description
`chromebot: a <url> -r <policy> -j <concurrency>`	Archive <url> with <concurrency> processes according to recursion <policy>.
`chromebot: s <uuid>`	Get job status for <uuid>.
`chromebot: r <uuid>`	Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

URL lists can be archived using recursion, for example:

chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4

chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).

Restrictions

Instagram

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).

People

PurpleSym maintains software, scripts, pays the server bills and has administrative access. katocala is a server administrator.

@@ Line 1: / Line 1: @@
-'''chromebot''' aka. '''crocoite''' is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. [[WARC]]s are uploaded twice a day to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org and are later ingested into the Wayback Machine. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.
+'''chromebot''' aka. '''crocoite''' was an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. On 2021-04-21 the bot was shut down. [[WARC]]s were uploaded twice a day to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org. For a given item in the collection, you can see what URLs are saved in the warc by looking at the associated jobs.json.gz file.
 By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [http://chromebot.6xq.net/ dashboard] is available for watching the progress of such jobs.
 == Usage ==
-[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot crocoite usage documentation on GitHub]
+[https://6xq.net/crocoite/usage/ crocoite usage documentation]
-You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon.
 {| class="wikitable"
@@ Line 27: / Line 25: @@
 <code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code>
-chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).
+chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be returned by the server as an *inline* document, not as a download (attachment).
 == Restrictions ==

Difference between revisions of "Chromebot"

Revision as of 20:58, 24 April 2021

Contents

Usage

Restrictions

Instagram

Cloudflare DDoS protection

People

Navigation menu

Difference between revisions of "Chromebot"

Revision as of 20:58, 24 April 2021

Usage

Restrictions

Instagram

Cloudflare DDoS protection

People

Navigation menu

Search