Difference between revisions of "Dev/Infrastructure"

From Archiveteam
< Dev
Jump to navigation Jump to search
m (Added a symptom of a website in danger)
m
(One intermediate revision by one other user not shown)
Line 54: Line 54:
=== Internet Archive ===
=== Internet Archive ===


The Internet Archive is an digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" instead of deleted.
The Internet Archive is a digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" instead of deleted.


Items are ingested by the Wayback Machine if it
Items are ingested by the Wayback Machine if it
Line 63: Line 63:


{{devnav}}
{{devnav}}
{{Navigation box}}

Revision as of 18:40, 28 June 2015

The Archive Team infrastructure is a distributed web processing system used for distributed preservation of service attacks.

Component Overview

Archiveteam warrior infrastructure.png

Figure Description
1 Website in Danger
2 Warrior
3 Tracker
4 Staging Server
5 Internet Archive

Website in Danger

The website in danger is typically a website exhibiting combinations of

  • acquihire
  • mass layoffs
  • neglect, decay, unhealthy, or owners missing in action
  • political and legal issues
  • robots.txt exclusion file that forbids crawling by Wayback Machine (whether intentionally or unintentionally)
  • cultural significance

Warrior

The Warrior is client code run by volunteers that grabs/scrapes the content of the website in danger.

Websites often implement throttling systems to protect themselves for various reasons such as spam or server load. Typical systems use IP address bans. As such, many Warriors, running on many IP addresses, are needed.

Content is usually grabbed and saved in WARC files.

Tracker

The Tracker is server code run by "core" Archive Team volunteers. The Tracker assigns what the Warrior should download and provides a leaderboard.

Staging Server

Staging servers are typically servers running Rsync often run by "core" volunteers. Warriors upload WARC files to these hosts. The hosts queue and package up the WARC files into large WARC files (Megawarcs). Then, the Megawarcs are uploaded to the Internet Archive under the Archive Team collection.

Internet Archive

The Internet Archive is a digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" instead of deleted.

Items are ingested by the Wayback Machine if it

  • has warc.gz files,
  • has a "web" media type,
  • and is under the Archive Team collection.


Developer Documentation