Difference between revisions of "Chromebot"

From Archiveteam
Jump to navigation Jump to search
m (→‎People: add devops scripts)
(14 intermediate revisions by 3 users not shown)
Line 1: Line 1:
chromebot is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy websites. Both, [https://github.com/PromyLOPh/crocoite software] and bot, are maintained by [[User:PurpleSymphony]]. WARCs are uploaded daily to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org.
'''chromebot''' aka. '''crocoite''' is an [[IRC]] bot parallel to [[ArchiveBot]] that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. [[WARC]]s are uploaded twice a day to the [https://archive.org/details/archiveteam_chromebot?sort=-publicdate chromebot collection] on archive.org.


By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [https://6xq.net/chromebot/ dashboard] is available for watching the progress of such jobs.
By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A [http://chromebot.6xq.net/ dashboard] is available for watching the progress of such jobs.


== Usage<ref name=usage>[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot ChromeBot usage documentation on GitHub]</ref> ==
== Usage ==
You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with it's parent [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon. The username can be autocompleted using the “<kbd>↹</kbd>Tab” key in the EFNet web chat interface or IRC client.
[https://github.com/PromyLOPh/crocoite/blob/184189f0a535996edca01a68182ed07d32e26e9c/README.rst#IRC-bot crocoite usage documentation on GitHub]
 
You can call ''chromebot'' on the {{IRC|archivebot}} IRC channel, which chromebot shares with [[ArchiveBot]]. Both “<code>chromebot</code>” and “<code>chromebot:</code>” work, with or without the colon.


{| class="wikitable"
{| class="wikitable"
|-
|-
! Command !! Description
! Command !! Description
|-
| white-space: nowrap |
<code>chromebot: a <url> -r <policy> -j <concurrency></code>
|| Archive ''<url>'' with ''<concurrency>'' processes according to recursion ''<policy>''.
|-
|-
| <code>chromebot: a <uuid></code><br /><code>chromebot a <uuid></code>  || Archive <url> with <concurrency> processes according to recursion <policy>.
| <code>chromebot: s <uuid></code></code> || Get job status for ''<uuid>''.
|-
| <code>chromebot: s <uuid></code><br /><code>chromebot s <uuid></code> ||     Get job status for <uuid>.
|-
|-
| <code>chromebot: r <uuid></code><br /><code>chromebot r <uuid></code> || Revoke or abort running job with <uuid>.
| <code>chromebot: r <uuid></code></code> || Revoke or abort running job with ''<uuid>''.
|}
|}


Please note that the commands are case-sensitive.
Please note that the commands are case-sensitive.
URL lists can be archived using recursion, for example:
<code>chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4</code>
chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).


== Restrictions ==
== Restrictions ==
=== Instagram.com ===
=== Instagram ===
ChromeBot has been blacklisted by [[Instagram]], a website infamous for being an archival loophole.
chromebot has been blacklisted by [[Instagram]]. When trying to archive any Instagram.com website, chromebot responds with the following error:
''<Instagram.com URL> cannot be queued: Banned by Instagram''


When trying to archive any Instagram.com website, chromebot responds with the following error:
=== Cloudflare DDoS protection ===
''<Instagram.com URL> cannot be queued: Banned by Instagram''
chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload ([https://github.com/PromyLOPh/crocoite/issues/13 issue #13 on GitHub]).


One way to bypass Instagram's restrictions partially is using [http://Insta-Stalker.com/ Insta-Stalker.com], which is just a third-party web viewer for Instagram, equipped with an AJAX-free user search feature and the ability to view profiles without Instagram's new Web-App-type website (similar to [https://mobile.twitter.com/ Twitter Lite]) that made Instagram inaccessible to the [[Wayback Machine]] and [[Archive.Today]]'s crawlers. The former gets stuck in an infinite refresh loop.
== People ==


'''URL format:'''
[[User:PurpleSymphony|PurpleSym]] maintains [https://github.com/PromyLOPh/crocoite software], [https://github.com/PromyLOPh/chromebot scripts], pays the server bills and has administrative access. katocala is a server administrator.
* Search URL: https://insta-stalker.com/search/?q=<code>Search+Term+here</code>
* User URL (example): https://insta-stalker.com/profile/SamsungMobile/


== References ==
[[Category:Bots]]
<references />

Revision as of 18:53, 30 December 2019

chromebot aka. crocoite is an IRC bot parallel to ArchiveBot that uses Google Chrome and thus is able to archive JavaScript-heavy and bottomless websites. WARCs are uploaded twice a day to the chromebot collection on archive.org.

By default the bot only grabs a single URL. However it supports recursion, which is rather slow, since every single page needs to be loaded and rendered by a browser. A dashboard is available for watching the progress of such jobs.

Usage

crocoite usage documentation on GitHub

You can call chromebot on the #archivebot (on hackint) IRC channel, which chromebot shares with ArchiveBot. Both “chromebot” and “chromebot:” work, with or without the colon.

Command Description

chromebot: a <url> -r <policy> -j <concurrency>

Archive <url> with <concurrency> processes according to recursion <policy>.
chromebot: s <uuid> Get job status for <uuid>.
chromebot: r <uuid> Revoke or abort running job with <uuid>.

Please note that the commands are case-sensitive.

URL lists can be archived using recursion, for example:

chromebot: a https://transfer.notkiska.pw/inline/UpfR/HollyConrad-tweets -r 1 -j 4

chromebot will assume all lines starting with http(s):// are valid links. Note that the list itself must be retured by the server as an *inline* document, not as a download (attachment).

Restrictions

Instagram

chromebot has been blacklisted by Instagram. When trying to archive any Instagram.com website, chromebot responds with the following error:

<Instagram.com URL> cannot be queued: Banned by Instagram

Cloudflare DDoS protection

chromebot should be able to circumvent Cloudflare's DDoS protection, but scrolling and other behaviour may be disabled after the reload (issue #13 on GitHub).

People

PurpleSym maintains software, scripts, pays the server bills and has administrative access. katocala is a server administrator.