Internet Archive

From Archiveteam
Jump to: navigation, search
Internet Archive
Internet Archive logo
Internet Archive mainpage in 2010-12-21
Internet Archive mainpage in 2010-12-21
URL http://www.archive.org [IA] [WebCite] [archive.today]
Project status Online!
Archiving status In progress...
Project source IA.BAK
Project tracker ia.bak
IRC channel #internetarchive.bak

The Internet Archive is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 400 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivist's wet dream. The Archive.org website also archives books, music, videos, and software.

Mirrors

There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. Both seem to be up and stable.

Some manually-selected collections are also mirrored manually as part of the project INTERNETARCHIVE.BAK. See that page and the section #Backing up the Internet Archive.

Raw Numbers

December 2010:

  • 4 data centers, 1,300 nodes, 11,000 spinning disks
  • Wayback Machine: 2.4 PetaBytes
  • Books/Music/Video Collections: 1.7 PetaBytes
  • Total used storage: 5.8 PetaBytes

August 2014:

  • 4 data centers, 550 nodes, 20,000 spinning disks
  • Wayback Machine: 9.6 PetaBytes
  • Books/Music/Video Collections: 9.8 PetaBytes
  • Unique data: 18.5 PetaBytes
  • Total used storage: 50 PetaBytes

Items added per year

Search made 21:56, 17 January 2016 (EST) (this is just from the (mutable) "addeddate" metadata, so it might change, although it shouldn't)

Year Items added
2001 63
2002 4,212
2003 18,259
2004 61,629
2005 61,004
2006 185,173
2007 334,015
2008 429,681
2009 807,371
2010 813,764
2011 1,113,083
2012 1,651,036
2013 3,164,482
2014 2,424,610
2015 3,113,601

Uploading to archive.org

Upload any content you manage to preserve! Registering takes a minute.

Tools:

  • For quick one-shot webpage archiving, use the Wayback Machine's "Save Page Now" tool.
  • S3 interface (for direct usage with curl, or indirect with the tool of your choice.)
  • Handy script for mass upload with automatic error checking and retry.
  • Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
    • Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
    • archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
    • For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
    • You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
    • archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.

Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Formats: anything, but:

  • Sites should be uploaded in WARC format;
  • Audio, video, books and other prints are supported from a number of formats;
  • For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

This unofficial documentation page explains various of the special files found in every item.

Downloading from archive.org

https://web.archive.org/web/20130806040521id_/http://faq.web.archive.org/page-without-wayback-code/

Browsing

There are 6 top-level collections in the Archive, which pretty-much everything else is under. These are:

  • web -- Web Crawls
  • texts -- eBooks and Texts
  • movies -- Moving Image Archive
  • audio -- Audio Archive
  • software -- The Internet Archive Software Collection
  • image -- Images

This is an incomplete list of significant sub-collections within the toplevel ones:

  • movies -- Moving Image Archive
    • opensource_movies -- Community Video
    • television -- Television
      • adviews -- AdViews
      • tv -- TV News Search & Borrow
    • tvarchive -- Television Archive (where the content in the "TV News Search & Borrow is located; not directly accessible)

Internet Archive/Collections is a list of all the collections that contain other collections.

Backing up the Internet Archive

The contents of the Wayback Machine as of 2002 (and again in 2006) have been duplicated in Alexandria, Egypt, available via http://www.bibalex.org/isis/frontend/archive/archive_web.aspx .

In April 2015, ArchiveTeam founder Jason Scott came up with an idea of a distributed backup of the Internet Archive. In the following months, the necessary tools got developed and volunteers with spare disk space appeared, and now tens of terabytes of rare and precious digital content of the Archive have already been cloned in several copies around the world. The project is open to everyone who has got at least a few hundred gigabytes of disk space that they can sacrifice on the medium or long term. For details, see the INTERNETARCHIVE.BAK page.

Let us clarify once again: ArchiveTeam is not the Internet Archive. This "backing up the Internet Archive" project, just like all the other website-rescuing ArchiveTeam projects are not ordered, asked for, organized or supported by the Internet Archive, nor are the ArchiveTeam members the employees of the Internet Archive (except a few ones). Besides accepting – and, in this case, providing – the content, the Internet Archive doesn't collaborate with the ArchiveTeam.

Most of the directly downloadable items at IA are also available as torrents -- at any given time some fraction of these have external seeders, although as of 01:46, 17 February 2016 (EST) there is a problem with IA's trackers where they refuse to track many of the torrents.

Technical notes

The history of tasks run on each item can be viewed (when logged in) by going to a URL of the form http:// archive.org/history/IDENTIFIER (where IDENTIFIER is the id of the item, e.g. the part after "/details/" in a typical IA url).

Some of the task commands include:

archive.php 
Initial uploading, adding of reviews, and other purposes (example)
bup.php 
Backing UP items from their primary to their secondary storage location after they are modified (always appears last in any group of tasks) (example)
derive.php 
Handles generating the derived data formats (e.g. converting audio files into mp3s, OCRing scanned texts, generating CDX indexes for WARCs) (example)
book_op.php 
 ? Includes virus scan, which usually takes a while. (example)
fixer.php 
 ? (example)
create.php 
 ? (example)
checkin.php 
 ? (example)
delete.php 
Used early on (i.e. ~2007) to delete a few items -- not used (except on some test files) since, apparently. (example)
make_dark.php 
Removes an item from public view; used for spam, malware, copyright issues, etc. (example)
modify_xml.php 
 ? (example)
make_undark.php 
Reverses the effect of make_dark.php (example)

See also

External links


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY · Deathwatch · Projects
Archiveteam.jpg
Archiving projects APKMirror · Archive.is · BetaArchive · Gmane · Internet Archive · It Died · Megalodon.jp · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite · Vaporwave.me
Blogging Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LiveJournal · My Opera · Nolblog.hu · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd
Cloud hosting/file sharing aDrive · AnyHub · Box · Dropbox · Docstoc · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase
Corporations Apple · IBM · Google · Lycos Europe · Microsoft · Yahoo!
Events Arab Spring · Occupy movement · Spanish Revolution
Font Repos Google Web Fonts · GNU FreeFont · Fontspace
Forums/Message boards 4chan · Captain Luffy Forums · College Confidential · DSLReports · ESPN Forums · forums.starwars.com · HeavenGames · Invisionfree · The Classic Horror Film Board · Yahoo! Messages · Yahoo! Neighbors · Yuku.com
Gaming Atomicgamer · City of Heroes · Club Nintendo · CS:GO Lounge · Desura · Dota 2 Lounge · Emulation Zone · GameMaker Sandbox · GameTrailers · Halo · HLTV.org · Infinite Crisis · Minecraft.net · Player.me · Playfire · Steam · SteamDB · Warhammer · Xfire
Image hosting 500px · AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · deviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotolog.com · Fotopedia · Frontback · Geograph Britain and Ireland · GTF Képhost · ImageShack · Imgur · Inkblazers · Instagr.am · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Snapjoy · Streetfiles · Tabblo · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons
Knowledge/Wikis arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram) · Horror Movie Database · Insurgency Wiki · Knol · Library Genesis · Lost Media Wiki · Neoseeker.com · Notepad.cc · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia) · Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com · Game Developer Magazine · Gigaom · Helium · JPG Magazine · Polygamia.pl · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices
Microblogging Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Twitter · TwitLonger
Music/Audio AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · This Is My Jam · TuneWiki · Twaud.io · WinAmp
People Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project
Protocols/Infrastructure FTP · Gopher · IRC · Usenet · World Wide Web
Q&A Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers
Recipes/Food Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList
Social bookmarking Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero
Social networks Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Friends Reunited · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · myVIP · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vkontakte · WeeWorld · Weibo · Wretch · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...
Shopping/Retail Alibaba · AliExpress · Amazon · Apple Store · eBay · Printfection · RadioShack · Sears · Target · The Book Depository · ThinkGeek · Walmart
Software/code hosting Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · Stypi · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz
Video hosting Academic Earth · Blip.tv · Epic · Google Video · Justin.tv · Niconico · Nokia Trailers · Qwiki · Skillfeed · Stickam · TED Talks · Ticker.tv · Twitch.tv · Ustream · Videoplayer.hu · Viddler · Viddy · Vimeo · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)
Web hosting Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Claranet Netherlands Personal Web Pages · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch) · Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Telenor · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webzdarma · Virgin Media
Web applications Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin
Other 800notes · AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Distill · Dmoz · Easel · Eircode · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · Forrst · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mobile Phone Applications · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Neopets · Quantcast · Quizilla · Salon Table Talk · Shutdownify · Slidecast · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · Volán · Widgetbox · Windows Technical Preview · Wunderlist · Zoocasa
Information A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG
Projects ArchiveCorps · Audit2014 · Emularity · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census) · IRC Quotes · JSMESS · JSVLC · Just Solve the Problem · NewsGrabber · Project Newsletter · Valhalla · Web Roasting (ISP Hosting · University Web Hosting) · Woohoo
Tools ArchiveBot · ArchiveTeam Warrior (Tracker) · Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)
Teams Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam
About Archive Team Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ