Internet Archive

Internet Archive
Internet Archive mainpage in 2010-12-21
URL	http://www.archive.org^{[IA•Wcite•.today•MemWeb]}
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
Project source	IA.BAK
Project tracker	ia.bak
IRC channel	#internetarchive.bak (on hackint)

The Internet Archive is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 400 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivist's wet dream. The Archive.org website also archives books, music, videos, and software.

Mirrors

There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. Both seem to be up and stable.

Some manually-selected collections are also mirrored manually as part of the project INTERNETARCHIVE.BAK. See that page and the section #Backing up the Internet Archive.

Raw Numbers

December 2010:

4 data centers, 1,300 nodes, 11,000 spinning disks
Wayback Machine: 2.4 PetaBytes
Books/Music/Video Collections: 1.7 PetaBytes
Total used storage: 5.8 PetaBytes

August 2014:

4 data centers, 550 nodes, 20,000 spinning disks
Wayback Machine: 9.6 PetaBytes
Books/Music/Video Collections: 9.8 PetaBytes
Unique data: 18.5 PetaBytes
Total used storage: 50 PetaBytes

Items added per year

Search made 21:56, 17 January 2016 (EST) (this is just from the (mutable) "addeddate" metadata, so it might change, although it shouldn't)

Year	Items added
2001	63
2002	4,212
2003	18,259
2004	61,629
2005	61,004
2006	185,173
2007	334,015
2008	429,681
2009	807,371
2010	813,764
2011	1,113,083
2012	1,651,036
2013	3,164,482
2014	2,424,610
2015	3,113,601

Uploading to archive.org

Upload any content you manage to preserve! Registering takes a minute.

Tools:

For quick one-shot webpage archiving, use the Wayback Machine's "Save Page Now" tool.
- There's also an awesome JavaScript Bookmarklet and Chrome extension made by @bitsgalore that provide a fast way to submit pages on the Internet Archive. You can get them here: http://www.bitsgalore.org/2014/08/02/How-to-save-a-web-page-to-the-Internet-Archive/
S3 interface (for direct usage with curl, or indirect with the tool of your choice.)
- internetarchive Python tool is one such tool.
Handy script for mass upload with automatic error checking and retry.
Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
- Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
- archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
- For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
- You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
- archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.

Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Formats: anything, but:

Sites should be uploaded in WARC format;
Audio, video, books and other prints are supported from a number of formats;
For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

This unofficial documentation page explains various of the special files found in every item.

Downloading from archive.org

Wayback Machine APIs
internetarchive Python tool
Manually, from an individual item: click "HTTPS"; or replace details with download in the URL and reload. This will take you to a page with a link to download a ZIP containing the original files and metadata.
In bulk: see http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
There's also an unofficial shell function that checks how many URLs the Wayback Machine lists for a domain name.
Individual files within .zip and .tar archives can be listed, and downloaded, by appending a slash after the /download/ URL. This will bring up a listing of the content, from a URL with zipviewer.php in it. For example: https://archive.org/download/CreativeComputing_v03n06_NovDec1977/Creative_Computing_v03n06_Nov_Dec_1977_jp2.zip/
To download a raw, unmodified page from the Wayback Machine, add "id_" to the end of the timestamp, e.g.

https://web.archive.org/web/20130806040521id_/http://faq.web.archive.org/page-without-wayback-code/

There are also some other codes that can be added to the end of the timestamp, as described here: http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html^{[IA•Wcite•.today•MemWeb]}

Browsing

There are 6 top-level collections in the Archive, which pretty-much everything else is under. These are:

web -- Web Crawls
texts -- eBooks and Texts
movies -- Moving Image Archive
audio -- Audio Archive
software -- The Internet Archive Software Collection
image -- Images

This is an incomplete list of significant sub-collections within the toplevel ones:

texts -- eBooks and Texts
- opensource -- Community Texts

movies -- Moving Image Archive
- opensource_movies -- Community Video
- television -- Television
  - adviews -- AdViews
  - tv -- TV News Search & Borrow
- tvarchive -- Television Archive (where the content in the "TV News Search & Borrow is located; not directly accessible)

audio -- Audio Archive
- opensource_audio -- Community Audio
- etree -- Live Music Archive
- librivoxaudio -- The LibriVox Free Audiobook Collection

software -- The Internet Archive Software Collection
- 301works -- 301Works.org
- consolelivingroom -- Console Living Room
- coverdiscs -- CD and DVD Coverdisc Collection
- softwarelibrary -- Software Library
- open_source_software -- Community Software

image -- Images
- flickrcommons -- Flickr Commons Archive
- maps_usgs -- USGS Maps
- nasa -- NASA Images
- coverartarchive -- Cover Art Archive

Backing up the Internet Archive

In April 2015, ArchiveTeam founder Jason Scott came up with an idea of a distributed backup of the Internet Archive. In the following months, the necessary tools got developed and volunteers with spare disk space appeared, and now tens of terabytes of rare and precious digital content of the Archive have already been cloned in several copies around the world. The project is open to everyone who has got at least a few hundred gigabytes of disk space that they can sacrifice on the medium or long term. For details, see the INTERNETARCHIVE.BAK page.

Let us clarify once again: ArchiveTeam is not the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are not ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in this case, providing – the content, the Internet Archive doesn't collaborate with the ArchiveTeam.

External links

http://www.archive.org^{[IA•Wcite•.today•MemWeb]}
Bibliotheca Alexandrina mirror^{[IA•Wcite•.today•MemWeb]}
Petabox details^{[IA•Wcite•.today•MemWeb]}
A python interface to archive.org^{[IA•Wcite•.today•MemWeb]}
JSON API for archive.org services and metadata^{[IA•Wcite•.today•MemWeb]}
Developer portal (beta)^{[IA•Wcite•.today•MemWeb]}
Old developer portal^{[IA•Wcite•.today•MemWeb]}
English Wikipedia page on Help:Using the Wayback Machine^{[IA•Wcite•.today•MemWeb]}
English Wikipedia page: Lists of Internet Archive's collections^{[IA•Wcite•.today•MemWeb]}