Internet Archive

From Archiveteam
Jump to: navigation, search
Internet Archive
Internet Archive logo
Internet Archive mainpage in 2010-12-21
Internet Archive mainpage in 2010-12-21
URL http://www.archive.org [IA] [WebCite]
Project status Online!
Archiving status Saved by itself
Project source Unknown
Project tracker Unknown
IRC channel #archiveteam

The Internet Archive is a non-profit digital library with the stated mission/motto: "universal access to all knowledge". The Internet Archive stores over 400 billion webpages from different dates and times for historical purposes that are available through the Wayback Machine, arguably an archivists wet dream. The Archive.org website also archives books, music, videos, and software.

Contents

Mirrors

There are currently two mirrors of the Internet Archive collection - the official mirror available at archive.org, and a second mirror at Bibliotheca Alexandrina. Both seem to be up and stable.

Raw Numbers

December 2010:

  • 4 data centers, 1,300 nodes, 11,000 spinning disks
  • Wayback Machine: 2.4 PetaBytes
  • Books/Music/Video Collections: 1.7 PetaBytes
  • Total used storage: 5.8 PetaBytes

August 2014:

  • 4 data centers, 550 nodes, 20,000 spinning disks
  • Wayback Machine: 9.6 PetaBytes
  • Books/Music/Video Collections: 9.8 PetaBytes
  • Unique data: 18.5 PetaBytes
  • Total used storage: 50 PetaBytes

Uploading to archive.org

Upload any content you manage to preserve! Registering takes a minute.

Tools:

  • For quick one-shot webpage archiving, use the Wayback Machine's "Save Page Now" tool.
  • S3 interface (for direct usage with curl, or indirect with the tool of your choice.)
  • internetarchive Python tool is one such tool.
  • Handy script for mass upload with automatic error checking and retry.
  • Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):
    • Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
    • archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
    • For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
    • You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn't work with udp trackers.)
    • archive.org will stop the download if the torrent stalls for some time and add a file to your item called "resume.tar.gz", which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don't delete the torrent file from the item.

Don't use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Formats: anything, but:

  • Sites should be uploaded in WARC format;
  • Audio, video, books and other prints are supported from a number of formats;
  • For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

See also

External links


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY · Deathwatch · Projects · Download available archives
Archiveteam.jpg
Archiving projects Archive.is · BetaArchive · Internet Archive · It Died · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES
The Dead, the Dying & The Damned · UK Web Archive · WebCite
Blogs/website hosts Angelfire · Blogger · Blogster · EtherPad · FortuneCity · Free ProHosting · Fuelmyblog · GeoCities (patch) · Google Sites · Jux · LiveJournal · My Opera · Open Diary · Posterous · Prodigy.net · Proust · Splinder · Tripod · Vox · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd
Corporations Apple · IBM · Google · Microsoft · Yahoo!
Events Arab Spring · Occupy movement · Spanish Revolution
Font Repos Google Web Fonts · GNU FreeFont · Fontspace
Image hosting services Cameroid · Flickr · Geograph Britain and Ireland · ImageShack · Imgur · Instagr.am · Panoramio · Photobucket · Picasa · Picplz · Ptch · puu.sh · Snapjoy · TwitPic · Wikimedia Commons
Knowledge/Wikis arXiv · Citizendium · Edit.This · Encyclopedia Dramatica · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books · Insurgency Wiki · Knol · Nupedia · OpenCourseWare · OpenStreetMap · Project Gutenberg · Puella Magi · Referata · SongMeanings · ShoutWiki · The Internet Movie Database · The Pirate Bay · TropicalWikis · Urban Dictionary · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia · Wikispaces · Wik.is · Wiki-Site · WikiTravel
Microblogging Heello · Identi.ca · Jaiku · Plurk · Sina Weibo · Tumblr · Twitter · TwitLonger
Music/Audio Audimated.com · digCCmixter · Dogmazic.net · Free Music Archive · Gogoyoko · Indaba Music · Jamendo · Last.fm · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · Twaud.io
People Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project
Q&A Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Expers Exchange · GirlsAskGuys · Google Answers · Google Questions and Answers · JustAnswer · MetaFilter · Quora · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers
Social bookmarking Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · Microsoft TechNet · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Zootool · Zotero
Social networks Bebo · BlackPlanet · Classmates.com · Cyworld · deviantART · Dopplr · douban · Ello · Facebook · Flixster · Friendster · Gaia Online · Google+ · Habbo · hi5 · Hyves · LinkedIn · mixi · MyHeritage · MyLife · Myspace · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Tagged · Viadeo · Vkontakte · WeeWorld · Wretch · more sites...
Software Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHub · Gitorious · Gna! · Google Code · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · mozdev · OSOR.eu · OW2 Consortium · Openmoko · Ourproject.org · Project Kenai · RubyForge · SEUL.org · SourceForge · tigris.org · Transifex · TuxFamily
Video hosting services Academic Earth · Blip.tv · Google Video · Justin.tv · TED Talks · Ustream · Viddler · Vimeo · Yahoo! Video · YouTube
Other 4chan · April Fools' Day · Amplicate · Circavie · Co.mments · Dmoz · Electronic Frontier Foundation · Feedly · Ficlets · FriendFeed · Gopher · Google Books Ngram · Google Reader · IFTTT · isoHunt · MegaUpload · MyBlogLog · Pastebin · Propeller.com · Quantcast · Salon Table Talk · SOPA blackout pages · World Wide Web · Yahoo! Buzz · Yahoo! Groups
Teams Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam
About Archive Team Introduction · Philosophy · Who We Are · Why Back Up? · Software · Films and documentaries about archiving · Formats · Cheap storage · Storage Media · Recommended Reading · FAQ
Personal tools