Restoring

From Archiveteam
Jump to: navigation, search

So, you have a website that's gone, for whatever reason, there's a copy in the Wayback Machine, and now you want to get it all back, preferably without clicking on every single page.

If you're lucky, the grab was done by Archive Team and the WARC file will be available in the Archive Team collection where you can just download the whole thing and then extract contents with one of the WARC tools like warctozip.

If it wasn't an Archive Team grab and the pages were just scooped up as part of normal Wayback Machine operation, things are a bit more difficult as archive.org does not allow you to download the WARC files for these crawls directly (and the data you want would be split across many grabs anyway as they tend to grab part of a site on one occasion, a different part later, etc.).

The Wayback Machine doesn't intentionally try to prevent you from downloading multiple pages, but the usual method of recursively downloading a directory with a tool like wget using the -np parameter does not work because the Wayback Machine date-codes URLs based on the time of crawl and so things which appear to be in the same directory are not. E.g.:

http://web.archive.org/web/20140208214426/http://archiveteam.org/index.php?title=Main_Page

links to

http://web.archive.org/web/20140215063724/http://archiveteam.org/index.php?title=Who_We_Are

which has a different date code, so as far as wget can tell they are in different directories and the crawl stops.

Tools to use

  • Wayback Machine Downloader, small tool in Ruby to download any website from the Wayback Machine. Free and open-source.
  • Warrick - Main site was at [1] but seems down. Downloads are available at [2]. Your mileage may vary - it's quite slow to run and the feature of grabbing from Google/Yahoo/Bing caches doesn't seem to work. A Linux/Cygwin/other *nix environment is also required. But currently probably the best option.
  • Wayback downloader, a service that will download your site from the Wayback Machine and even add a plugin for Wordpress, 1 site is $15, 2 to 4, $12.50 and 5 or more is $7,50. Cheap way to get data back without setting up your own environment. Effectiveness unknown.
  • Wayback Machine Downloader Service Another service that recovers websites from archive.org. It has a free demo and offers unlimited downloads for $79. Not related to the aforementioned Ruby tool with the same name.
  • Waybackr, A new service that downloads, packs and sends to your email a copy of any website stored in the Wayback Machine. This service was free, but it seems to have stopped working.

Tricks

Unmodified pages

This is undocumented, but if you retrieve a page with id_ after the datecode, you will get the unmodified original document without all the Wayback scripts, header stuff, and link rewriting. This is useful when restoring one page at a time or when writing a tool to retrieve a site:

http://web.archive.org/web/20051001001126id_/http://www.archive.org/

Wildcard search

You can do a wildcard search for all URLs Wayback has retrieved for a given domain like so:

http://web.archive.org/web/*/http://archiveteam.org/*

Or for a subdirectory:

http://web.archive.org/web/*/http://archiveteam.org/images/*

The "filter results" textbox in the upper right allows you to type e.g. ".jpg" to show only matching files.

This data is also available in a machine-readable format:

http://web.archive.org/cdx/search/cdx?url=archiveteam.org/images/*


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY  · Deathwatch  · Projects
Archiveteam.jpg
Archiving projects APKMirror  · Archive.is  · BetaArchive  · Gmane  · Internet Archive  · It Died  · Megalodon.jp  · OldApps.com  · OldVersion.com  · OSBetaArchive  · TEXTFILES.COM  · The Dead, the Dying & The Damned  · The Mail Archive  · UK Web Archive  · WebCite  · Vaporwave.me
Blogging Blog.pl  · Blogger  · Blogster  · Blogter.hu  · Freeblog.hu  · Fuelmyblog  · Jux  · LiveJournal  · My Opera  · Nolblog.hu  · Open Diary  · ownlog.com  · Posterous  · Powerblogs  · Proust  · Roon  · Splinder  · Tumblr  · Vox  · Weblog.nl  · Windows Live Spaces  · Wordpress.com  · Xanga  · Yahoo! Blog  · Zapd
Cloud hosting/file sharing aDrive  · AnyHub  · Box  · Dropbox  · Docstoc  · Google Drive  · Google Groups Files  · iCloud  · Fileplanet  · LayerVault  · MediaCrush  · MediaFire  · Mega  · MegaUpload  · MobileMe  · OneDrive  · Pomf.se  · RapidShare  · Ubuntu One  · Yahoo! Briefcase
Corporations Apple  · IBM  · Google  · Lycos Europe  · Microsoft  · Yahoo!
Events Arab Spring  · Occupy movement  · Spanish Revolution
Font Repos Google Web Fonts  · GNU FreeFont  · Fontspace
Forums/Message boards 4chan  · Captain Luffy Forums  · College Confidential  · DSLReports  · ESPN Forums  · forums.starwars.com  · HeavenGames  · Invisionfree  · The Classic Horror Film Board  · Yahoo! Messages  · Yahoo! Neighbors  · Yuku.com
Gaming Atomicgamer  · City of Heroes  · Club Nintendo  · CS:GO Lounge  · Desura  · Dota 2 Lounge  · Emulation Zone  · GameMaker Sandbox  · GameTrailers  · Halo  · HLTV.org  · Infinite Crisis  · Minecraft.net  · Player.me  · Playfire  · Steam  · SteamDB  · Warhammer  · Xfire
Image hosting 500px  · AOL Pictures  · Blipfoto  · Blingee  · Canv.as  · Camera+  · Cameroid  · DailyBooth  · Degree Confluence Project  · deviantART  · Demotivalo.net  · Flickr  · Fotoalbum.hu  · Fotolog.com  · Fotopedia  · Frontback  · Geograph Britain and Ireland  · GTF Képhost  · ImageShack  · Imgur  · Inkblazers  · Instagr.am  · Kepfeltoltes.hu  · Kephost.com  · Kephost.hu  · Kepkezelo.com  · Keptarad.hu  · Madden GIFERATOR  · MLKSHK  · Microsoft Clip Art  · Nokia Memories  · noob.hu  · Odysee  · Panoramio  · Photobucket  · Picasa  · Picplz  · PSharing  · Ptch  · puu.sh  · Rawporter  · Relay.im  · ScreenshotsDatabase.com  · Snapjoy  · Streetfiles  · Tabblo  · Trovebox  · TwitPic  · Wallbase  · Wallhaven  · Webshots  · Wikimedia Commons
Knowledge/Wikis arXiv  · Citizendium  · Clipboard.com  · Deletionpedia  · EditThis  · Encyclopedia Dramatica  · Etherpad  · Everything2  · infoAnarchy  · GeoNames  · GNUPedia  · Google Books (Google Books Ngram)  · Horror Movie Database  · Insurgency Wiki  · Knol  · Library Genesis  · Lost Media Wiki  · Neoseeker.com  · Notepad.cc  · Nupedia  · OpenCourseWare  · OpenStreetMap  · Orain  · Pastebin  · Patch.com  · Project Gutenberg  · Puella Magi  · Referata  · Resedagboken  · SongMeanings  · ShoutWiki  · The Internet Movie Database  · TropicalWikis  · Uncyclopedia  · Urban Dictionary  · Webmonkey  · Wikia  · Wikidot  · WikiHow  · Wikkii  · WikiLeaks  · Wikipedia (Simple English Wikipedia)  · Wikispaces  · Wikispot  · Wik.is  · Wiki-Site  · WikiTravel  · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com  · Game Developer Magazine  · Gigaom  · Helium  · JPG Magazine  · Polygamia.pl  · San Fransisco Bay Guardian  · Scoop  · Regretsy  · Yahoo! Voices
Microblogging Heello  · Identi.ca  · Jaiku  · Mommo.hu  · Plurk  · Sina Weibo  · Twitter  · TwitLonger
Music/Audio AOL Music  · Audimated.com  · Cinch  · digCCmixter  · Dogmazic.net  · Earbits  · exfm  · Free Music Archive  · Gogoyoko  · Indaba Music  · Instacast  · Jamendo  · Last.fm  · Music Unlimited  · MOG  · PureVolume  · Reverbnation  · ShareTheMusic  · SoundCloud  · Soundpedia  · This Is My Jam  · TuneWiki  · Twaud.io  · WinAmp
People Aaron Swartz  · Michael S. Hart  · Steve Jobs  · Mark Pilgrim  · Dennis Ritchie  · Len Sassaman Project
Protocols/Infrastructure FTP  · Gopher  · IRC  · Usenet  · World Wide Web
Q&A Askville  · Answerbag  · Answers.com  · Ask.com  · Askalo  · Baidu Knows  · Blurtit  · ChaCha  · Experts Exchange  · Formspring  · GirlsAskGuys  · Google Answers  · Google Baraza  · JustAnswer  · MetaFilter  · Quora  · Retrospring  · StackExchange  · The AnswerBank  · The Internet Oracle  · Uclue  · WikiAnswers  · Yahoo! Answers
Recipes/Food Allrecipes  · Epicurious  · Food.com  · Foodily  · Food Network  · Punchfork  · ZipList
Social bookmarking Addinto  · Backflip  · Balatarin  · BibSonomy  · Bkmrx  · Blinklist  · BlogMarks  · BookmarkSync  · CiteULike  · Connotea  · Delicious  · Designer News  · Digg  · Diigo  · Dir.eccion.es  · Evernote  · Excite Bookmark  · Faves  · Favilous  · folkd  · Freelish  · Getboo  · GiveALink.org  · Gnolia  · Google Bookmarks  · Hacker News  · HeyStaks  · IndianPad  · Kippt  · Knowledge Plaza  · Licorize  · Linkwad  · Menéame  · Microsoft Developer Network  · myVIP  · Mister Wong  · My Web  · Mylink Vault  · Newsvine  · Oneview  · Pearltrees  · Pinboard  · Pocket  · Propeller.com  · Reddit  · sabros.us  · Scloog  · Scuttle  · Simpy  · SiteBar  · Slashdot  · Squidoo  · StumbleUpon  · Twine  · Vizited  · Yummymarks  · Xmarks  · Yahoo! Buzz  · Zootool  · Zotero
Social networks Bebo  · BlackPlanet  · Classmates.com  · Cyworld  · Dogster  · Dopplr  · douban  · Ello  · Facebook  · Flixster  · FriendFeed  · Friendster  · Friends Reunited  · Gaia Online  · Google+  · Habbo  · hi5  · Hyves  · iWiW  · LinkedIn  · Miiverse  · mixi  · MyHeritage  · MyLife  · Myspace  · myVIP  · Netlog  · Odnoklassniki  · Orkut  · Plaxo  · Qzone  · Renren  · Skyrock  · Sonico.com  · Storylane  · Tagged  · tvtag  · Upcoming  · Viadeo  · Vine  · Vkontakte  · WeeWorld  · Weibo  · Wretch  · Yahoo! Groups  · Yahoo! Stars India  · Yahoo! Upcoming  · more sites...
Shopping/Retail Alibaba  · AliExpress  · Amazon  · Apple Store  · eBay  · Printfection  · RadioShack  · Sears  · Target  · The Book Depository  · ThinkGeek  · Walmart
Software/code hosting Android Development  · Alioth  · Assembla  · BerliOS  · Betavine  · Bitbucket  · BountySource  · Codecademy  · CodePlex  · Freepository  · Free Software Foundation  · GNU Savannah  · GitHost  · GitHub  · GitHub Downloads  · Gitorious  · Gna!  · Google Code  · ibiblio  · java.net  · JavaForge  · KnowledgeForge  · Launchpad  · LuaForge  · Maemo  · mozdev  · OSOR.eu  · OW2 Consortium  · Openmoko  · OpenSolaris  · Ourproject.org  · Ovi Store  · Project Kenai  · RubyForge  · SEUL.org  · SourceForge  · Stypi  · TestFlight  · tigris.org  · Transifex  · TuxFamily  · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent  · EZTV  · isoHunt  · KickassTorrents  · The Pirate Bay  · Torrentz
Video hosting Academic Earth  · Blip.tv  · Epic  · Google Video  · Justin.tv  · Niconico  · Nokia Trailers  · Qwiki  · Skillfeed  · Stickam  · TED Talks  · Ticker.tv  · Twitch.tv  · Ustream  · Videoplayer.hu  · Viddler  · Viddy  · Vimeo  · Vine  · Vstreamers  · Yahoo! Video  · YouTube  · Famous Internet videos (Me at the zoo)
Web hosting Angelfire  · Brace.io  · BT Internet  · CableAmerica Personal Web Space  · Claranet Netherlands Personal Web Pages  · Comcast Personal Web Pages  · Extra.hu  · FortuneCity  · Free ProHosting  · GeoCities (patch)  · Google Business Sitebuilder  · Google Sites  · Internet Centrum  · MBinternet  · MSN TV  · Nwnyet  · Parodius Networking  · Prodigy.net  · Saunalahti Iso G  · Swipnet  · Telenor  · Tripod  · University of Michigan personal webpages  · Verizon Mysite  · Verizon Personal Web Space  · Webzdarma  · Virgin Media
Web applications Mailman  · MediaWiki  · phpBB  · Simple Machines Forum  · vBulletin
Other 800notes  · AOL  · Akoha  · Ancestry.com  · April Fools' Day  · Amplicate  · AutoAdmit  · Bre.ad  · Circavie  · Cobook  · Co.mments  · Countdown  · Distill  · Dmoz  · Easel  · Eircode  · Electronic Frontier Foundation  · FanFiction.Net  · Feedly  · Ficlets  · Forrst  · FunnyExam.com  · FurAffinity  · Google Helpouts  · Google Moderator  · Google Reader  · ICQmail  · IFTTT  · Jajah  · JuniorNet  · Lulu Poetry  · Mobile Phone Applications  · Mochi Media  · Mozilla Firefox  · MyBlogLog  · NBII  · Neopets  · Quantcast  · Quizilla  · Salon Table Talk  · Shutdownify  · Slidecast  · SOPA blackout pages  · starwars.yahoo.com  · TechNet  · Toshiba Support  · Volán  · Widgetbox  · Windows Technical Preview  · Wunderlist  · Zoocasa
Information A Million Ways to Die on the Web  · Backup Tips  · Cheap storage  · Collecting items randomly  · Data compression algorithms and tools  · Dev  · Discovery Data  · DOS Floppies  · Fortress of Solitude  · Keywords  · Naughty List  · Nightmare Projects  · Rescuing floppy disks  · Rescuing optical media  · Site exploration  · The WARC Ecosystem  · Working with ARCHIVE.ORG
Projects ArchiveCorps  · Audit2014  · Emularity  · Faceoff  · FlickrFckr  · Froogle  · INTERNETARCHIVE.BAK (Internet Archive Census)  · IRC Quotes  · JSMESS  · JSVLC  · Just Solve the Problem  · NewsGrabber  · Project Newsletter  · Valhalla  · Web Roasting (ISP Hosting  · University Web Hosting)  · Woohoo
Tools ArchiveBot  · ArchiveTeam Warrior (Tracker)  · Google Takeout  · HTTrack  · Video downloaders  · Wget (Lua  · WARC)
Teams Bibliotheca Anonoma  · LibreTeam  · URLTeam  · Yahoo Video Warroom  · WikiTeam
About Archive Team Introduction  · Philosophy  · Who We Are  · Our stance on robots.txt  · Why Back Up?  · Software  · Formats  · Storage Media  · Recommended Reading  · Films and documentaries about archiving  · Talks  · In The Media  · FAQ