Site exploration

From Archiveteam
Jump to: navigation, search

This page contains some tips and tricks for exploring soon-to-be-dead websites, to find URLs to feed into the Archive Team crawlers.

Open Directory Project data

The Open Directory Project offers machine-readable downloads of its data. You want the "content.rdf.u8.gz" from there.

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Quick-and-dirty shell parsing for the not-too-fussy:

grep '<link r:resource=.*dyingsite\.com' content.rdf.u8 | sed 's/.*<link r\:resource="\([^"]*\).*".*/\1/' | sort | uniq > odp-sitelist.txt

MediaWiki wikis

MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.

mwlinkscrape.py is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis (it actually uses the text of this page to get a list of wikis to scrape).

./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt

Search Engines

There exists tools such as GoogleScraper which will scrape various search engines using a web browser instead of a API.

Below lists some specific helpful tips.

Google

Google doesn't let its search results be scraped by automated tools. One must do it manually, but there are some tools and tips that still let you do a good discovery quite quickly.

To find results under a domain, let your search term be site:dyingsite.com.

If you want more than 10 results per page, add the num parameter to the URL like this:

https://www.google.com/search?q=site:dyingsite.com&num=100

To go to the next page of the results, don't use the "Next" link on the bottom; that would give you the next ten results. Instead, use the start parameter in the URL:

https://www.google.com/search?q=site:dyingsite.com&num=100&start=100

You can go up to 900, that is the 901–1000th results. Google doesn't let you browse more than the first 1000 results. However, there are some good news:

  • The estimated number of results shown is usually like ten or hundred times more than the actual number of results you'll be presented. So don't panic.
  • Should the number of results be indeed more than 1000, the easiest workaround is clicking on "Search tools", then on "Any time", and selecting "Custom range". Setting a specific range, you can reduce the number of results in one search, and going, say, year by year, you'll be presented with all the results. (Hopefully.)

For exporting the results efficiently, there must be several tools around. One of them is SEOquake, a Firefox extension. (In fact, exporting search results is just one feature of it.) After installing the extension and restarting the browser, Google search results will have buttons to export (save or append) the results in CSV format. (It is recommended to disable – in SEOquake options – all those analysis apperaing in the search results and on the toolbar, they are just slowing things down and occupy a lot of space in the CSV.) – After some repetitive but easy work, you'll have the URL list in your CSV(s). If SEOquake analysis things are turned off, it will be just the URLs embraced with quotation marks. Replace "s with nothing in a text editor, or for Linux terminal geeks, cut -d'"' -f 2 is the way.

Should Google stand in your way with a captcha, fill it, then you can proceed. (Cookies must be enabled for it to work.)

Bing API

Microsoft, bless their Redmondish hearts, have an API for fetching Bing search engine results, which has a free tier of 5000 queries per month (this will cover you for about 250 sets of 1000 results). However, it only returns the first 1000 results for any query, so you can't just search "site:dyingsite.com" and get all the things on a site. You'll need to get a bit creative with the search terms.

Grab this Python script (look for "BING_API_KEY" and replace it with your "Primary Account Key"), and then:

python bingscrape.py "site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "about me site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "gallery site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "in memoriam site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "diary site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "bob site:dyingsite.com" >> bing-sitelist.txt

And so on.

Common Crawl Index

The Common Crawl index is a very big (21 gigabytes compressed) list of URLs in the Common Crawl corpus. Grepping this list may well reveal plenty of URLs to archive. The list is in an odd format; along the lines of com.deadsite.www/subdirectory/subsubdirectory:http so you'll need to some filtering of the results. The results can sometimes be ambiguous.

grep '^com\.dyingsite[/\.]' zfqwbPRW.txt > commoncrawl-sitelist.txt

Our Ivan wrote a Python script (Mirror) which will take your list of URLs on standard input and print out a list of normally-formed URLs on standard output.

You can also use the Common Crawl URL search and get the results as a JSON file. Quick-and-dirty grep/sed parsing:

grep -F '"url":' locations.json | sed 's/.*url": "\([^"]*\).*/\1/' | sort | uniq > commoncrawl-sitelist.txt

Twitter

  • Twitter's search API doesn't offer historical results. However, their web search does a complete index now[1] including a searching expanded URLs.
  • A tool like Litterapi will scrape their web search and build a fake API.
  • t by sferik is a command-line interface for Twitter using the API via an application you create on your account. Not only does it allow easy CSV/JSON export of your own data, but it allows you to scrape others tweets. API limits apply but this tool is very powerful
  • Topsy offers a competing search service with an API of all Tweets. However, it is not free (but perhaps you can borrow their API key) and does not search expanded URLs.

See Also

References

  1. https://blog.twitter.com/2014/building-a-complete-tweet-index

[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY  · Deathwatch  · Projects
Archiveteam.jpg
Archiving projects APKMirror  · Archive.is  · BetaArchive  · Government Backup (#datarefuge  · ftp-gov)  · Gmane  · Internet Archive  · It Died  · Megalodon.jp  · OldApps.com  · OldVersion.com  · OSBetaArchive  · TEXTFILES.COM  · The Dead, the Dying & The Damned  · The Mail Archive  · UK Web Archive  · WebCite  · Vaporwave.me
Blogging Blog.pl  · Blogger  · Blogster  · Blogter.hu  · Freeblog.hu  · Fuelmyblog  · Jux  · LiveJournal  · My Opera  · Nolblog.hu  · Open Diary  · ownlog.com  · Posterous  · Powerblogs  · Proust  · Roon  · Splinder  · Tumblr  · Vox  · Weblog.nl  · Windows Live Spaces  · Wordpress.com  · Xanga  · Yahoo! Blog  · Zapd
Cloud hosting/file sharing aDrive  · AnyHub  · Box  · Dropbox  · Docstoc  · Google Drive  · Google Groups Files  · iCloud  · Fileplanet  · LayerVault  · MediaCrush  · MediaFire  · Mega  · MegaUpload  · MobileMe  · OneDrive  · Pomf.se  · RapidShare  · Ubuntu One  · Yahoo! Briefcase
Corporations Apple  · IBM  · Google  · Lycos Europe  · Microsoft  · Yahoo!
Events Arab Spring  · Great Ape-Snake War  · Spanish Revolution
Font Repos Google Web Fonts  · GNU FreeFont  · Fontspace
Forums/Message boards 4chan  · Captain Luffy Forums  · College Confidential  · DSLReports  · ESPN Forums  · forums.starwars.com  · HeavenGames  · Invisionfree  · The Classic Horror Film Board  · Yahoo! Messages  · Yahoo! Neighbors  · Yuku.com
Gaming Atomicgamer  · City of Heroes  · Club Nintendo  · CS:GO Lounge  · Desura  · Dota 2  · Dota 2 Lounge  · Emulation Zone  · ESEA  · GameBanana  · GameMaker Sandbox  · GameTrailers  · Halo  · HLTV.org  · Infinite Crisis  · Minecraft.net  · Player.me  · Playfire  · Raptr  · Steam  · SteamDB  · TF2 Outpost  · Warhammer  · Xfire
Image hosting 500px  · AOL Pictures  · Blipfoto  · Blingee  · Canv.as  · Camera+  · Cameroid  · DailyBooth  · Degree Confluence Project  · deviantART  · Demotivalo.net  · Flickr  · Fotoalbum.hu  · Fotolog.com  · Fotopedia  · Frontback  · Geograph Britain and Ireland  · GTF Képhost  · ImageShack  · Imgur  · Inkblazers  · Instagr.am  · Kepfeltoltes.hu  · Kephost.com  · Kephost.hu  · Kepkezelo.com  · Keptarad.hu  · Madden GIFERATOR  · MLKSHK  · Microsoft Clip Art  · Microsoft Photosynth  · Nokia Memories  · noob.hu  · Odysee  · Panoramio  · Photobucket  · Picasa  · Picplz  · Pixiv  · PSharing  · Ptch  · puu.sh  · Rawporter  · Relay.im  · ScreenshotsDatabase.com  · Snapjoy  · Streetfiles  · Tabblo  · Tinypic  · Trovebox  · TwitPic  · Wallbase  · Wallhaven  · Webshots  · Wikimedia Commons
Knowledge/Wikis arXiv  · Citizendium  · Clipboard.com  · Deletionpedia  · EditThis  · Encyclopedia Dramatica  · Etherpad  · Everything2  · infoAnarchy  · GeoNames  · GNUPedia  · Google Books (Google Books Ngram)  · Horror Movie Database  · Insurgency Wiki  · Knol  · Lost Media Wiki  · Neoseeker.com  · Notepad.cc  · Nupedia  · OpenCourseWare  · OpenStreetMap  · Orain  · Pastebin  · Patch.com  · Project Gutenberg  · Puella Magi  · Referata  · Resedagboken  · SongMeanings  · ShoutWiki  · The Internet Movie Database  · TropicalWikis  · Uncyclopedia  · Urban Dictionary  · Webmonkey  · Wikia  · Wikidot  · WikiHow  · Wikkii  · WikiLeaks  · Wikipedia (Simple English Wikipedia)  · Wikispaces  · Wikispot  · Wik.is  · Wiki-Site  · WikiTravel  · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com  · Game Developer Magazine  · Gigaom  · Helium  · JPG Magazine  · Polygamia.pl  · San Fransisco Bay Guardian  · Scoop  · Regretsy  · Yahoo! Voices
Microblogging Heello  · Identi.ca  · Jaiku  · Mommo.hu  · Plurk  · Sina Weibo  · Twitter  · TwitLonger
Music/Audio AOL Music  · Audimated.com  · Cinch  · digCCmixter  · Dogmazic.net  · Earbits  · exfm  · Free Music Archive  · Gogoyoko  · Indaba Music  · Instacast  · Jamendo  · Last.fm  · Music Unlimited  · MOG  · PureVolume  · Reverbnation  · ShareTheMusic  · SoundCloud  · Soundpedia  · This Is My Jam  · TuneWiki  · Twaud.io  · WinAmp
People Aaron Swartz  · Michael S. Hart  · Steve Jobs  · Mark Pilgrim  · Dennis Ritchie  · Len Sassaman Project
Protocols/Infrastructure FTP  · Gopher  · IRC  · Usenet  · World Wide Web
Q&A Askville  · Answerbag  · Answers.com  · Ask.com  · Askalo  · Baidu Knows  · Blurtit  · ChaCha  · Experts Exchange  · Formspring  · GirlsAskGuys  · Google Answers  · Google Baraza  · JustAnswer  · MetaFilter  · Quora  · Retrospring  · StackExchange  · The AnswerBank  · The Internet Oracle  · Uclue  · WikiAnswers  · Yahoo! Answers
Recipes/Food Allrecipes  · Epicurious  · Food.com  · Foodily  · Food Network  · Punchfork  · ZipList
Social bookmarking Addinto  · Backflip  · Balatarin  · BibSonomy  · Bkmrx  · Blinklist  · BlogMarks  · BookmarkSync  · CiteULike  · Connotea  · Delicious  · Designer News  · Digg  · Diigo  · Dir.eccion.es  · Evernote  · Excite Bookmark  · Faves  · Favilous  · folkd  · Freelish  · Getboo  · GiveALink.org  · Gnolia  · Google Bookmarks  · Hacker News  · HeyStaks  · IndianPad  · Kippt  · Knowledge Plaza  · Licorize  · Linkwad  · Menéame  · Microsoft Developer Network  · myVIP  · Mister Wong  · My Web  · Mylink Vault  · Newsvine  · Oneview  · Pearltrees  · Pinboard  · Pocket  · Propeller.com  · Reddit  · sabros.us  · Scloog  · Scuttle  · Simpy  · SiteBar  · Slashdot  · Squidoo  · StumbleUpon  · Twine  · Vizited  · Yummymarks  · Xmarks  · Yahoo! Buzz  · Zootool  · Zotero
Social networks Bebo  · BlackPlanet  · Classmates.com  · Cyworld  · Dogster  · Dopplr  · douban  · Ello  · Facebook  · Flixster  · FriendFeed  · Friendster  · Friends Reunited  · Gaia Online  · Google+  · Habbo  · hi5  · Hyves  · iWiW  · LinkedIn  · Miiverse  · mixi  · MyHeritage  · MyLife  · Myspace  · myVIP  · Netlog  · Odnoklassniki  · Orkut  · Plaxo  · Qzone  · Renren  · Skyrock  · Sonico.com  · Storylane  · Tagged  · tvtag  · Upcoming  · Viadeo  · Vine  · Vkontakte  · WeeWorld  · Weibo  · Wretch  · Yahoo! Groups  · Yahoo! Stars India  · Yahoo! Upcoming  · more sites...
Shopping/Retail Alibaba  · AliExpress  · Amazon  · Apple Store  · eBay  · Printfection  · RadioShack  · Sears  · Target  · The Book Depository  · ThinkGeek  · Walmart
Software/code hosting Android Development  · Alioth  · Assembla  · BerliOS  · Betavine  · Bitbucket  · BountySource  · Codecademy  · CodePlex  · Freepository  · Free Software Foundation  · GNU Savannah  · GitHost  · GitHub  · GitHub Downloads  · Gitorious  · Gna!  · Google Code  · ibiblio  · java.net  · JavaForge  · KnowledgeForge  · Launchpad  · LuaForge  · Maemo  · mozdev  · OSOR.eu  · OW2 Consortium  · Openmoko  · OpenSolaris  · Ourproject.org  · Ovi Store  · Project Kenai  · RubyForge  · SEUL.org  · SourceForge  · Stypi  · TestFlight  · tigris.org  · Transifex  · TuxFamily  · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent  · EZTV  · isoHunt  · KickassTorrents  · The Pirate Bay  · Torrentz  · Library Genesis
Video hosting Academic Earth  · Blip.tv  · Epic  · Google Video  · Justin.tv  · Niconico  · Nokia Trailers  · Plays.tv  · Qwiki  · Skillfeed  · Stickam  · TED Talks  · Ticker.tv  · Twitch.tv  · Ustream  · Videoplayer.hu  · Viddler  · Viddy  · Vimeo  · Vine  · Vstreamers  · Yahoo! Video  · YouTube  · Famous Internet videos (Me at the zoo)
Web hosting Angelfire  · Brace.io  · BT Internet  · CableAmerica Personal Web Space  · Claranet Netherlands Personal Web Pages  · Comcast Personal Web Pages  · Extra.hu  · FortuneCity  · Free ProHosting  · GeoCities (patch)  · Google Business Sitebuilder  · Google Sites  · Internet Centrum  · MBinternet  · MSN TV  · Nwnyet  · Parodius Networking  · Prodigy.net  · Saunalahti Iso G  · Swipnet  · Telenor  · Tripod  · University of Michigan personal webpages  · Verizon Mysite  · Verizon Personal Web Space  · Webzdarma  · Virgin Media
Web applications Mailman  · MediaWiki  · phpBB  · Simple Machines Forum  · vBulletin
Other 800notes  · AOL  · Akoha  · Ancestry.com  · April Fools' Day  · Amplicate  · AutoAdmit  · Bre.ad  · Circavie  · Cobook  · Co.mments  · Countdown  · Distill  · Dmoz  · Easel  · Eircode  · Electronic Frontier Foundation  · FanFiction.Net  · Feedly  · Ficlets  · Forrst  · FunnyExam.com  · FurAffinity  · Google Helpouts  · Google Moderator  · Google Reader  · ICQmail  · IFTTT  · Jajah  · JuniorNet  · Lulu Poetry  · Mobile Phone Applications  · Mochi Media  · Mozilla Firefox  · MyBlogLog  · NBII  · Neopets  · Quantcast  · Quizilla  · Salon Table Talk  · Shutdownify  · Slidecast  · SOPA blackout pages  · starwars.yahoo.com  · TechNet  · Toshiba Support  · USA-Gov  · Volán  · Widgetbox  · Windows Technical Preview  · Wunderlist  · Zoocasa
Information A Million Ways to Die on the Web  · Backup Tips  · Cheap storage  · Collecting items randomly  · Data compression algorithms and tools  · Dev  · Discovery Data  · DOS Floppies  · Fortress of Solitude  · Keywords  · Naughty List  · Nightmare Projects  · Rescuing floppy disks  · Rescuing optical media  · Site exploration  · The WARC Ecosystem  · Working with ARCHIVE.ORG
Projects ArchiveCorps  · Audit2014  · Emularity  · Faceoff  · FlickrFckr  · Froogle  · INTERNETARCHIVE.BAK (Internet Archive Census)  · IRC Quotes  · JSMESS  · JSVLC  · Just Solve the Problem  · NewsGrabber  · Project Newsletter  · Valhalla  · Web Roasting (ISP Hosting  · University Web Hosting)  · Woohoo
Tools ArchiveBot  · ArchiveTeam Warrior (Tracker)  · Google Takeout  · HTTrack  · Video downloaders  · Wget (Lua  · WARC)
Teams Bibliotheca Anonoma  · LibreTeam  · URLTeam  · Yahoo Video Warroom  · WikiTeam
About Archive Team Introduction  · Philosophy  · Who We Are  · Our stance on robots.txt  · Why Back Up?  · Software  · Formats  · Storage Media  · Recommended Reading  · Films and documentaries about archiving  · Talks  · In The Media  · FAQ