Difference between revisions of "Site exploration"
(→Common Crawl Index: also this) |
(link to keywords) |
||
Line 43: | Line 43: | ||
grep -F '"url":' locations.json | sed 's/.*url": "\([^"]*\).*/\1/' | sort | uniq > commoncrawl-sitelist.txt | grep -F '"url":' locations.json | sed 's/.*url": "\([^"]*\).*/\1/' | sort | uniq > commoncrawl-sitelist.txt | ||
== See Also == | |||
* [[Keywords]] | |||
[[Category:Archive Team]] | [[Category:Archive Team]] |
Revision as of 04:44, 24 December 2014
This page contains some tips and tricks for exploring soon-to-be-dead websites, to find URLs to feed into the Archive Team crawlers.
Open Directory Project data
The Open Directory Project offers machine-readable downloads of its data. You want the "content.rdf.u8.gz" from there.
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz
Quick-and-dirty shell parsing for the not-too-fussy:
grep '<link r:resource=.*dyingsite\.com' content.rdf.u8 | sed 's/.*<link r\:resource="\([^"]*\).*".*/\1/' | sort | uniq > odp-sitelist.txt
MediaWiki wikis
MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.
mwlinkscrape.py is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis (it actually uses the text of this page to get a list of wikis to scrape).
./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt
Bing API
Microsoft, bless their Redmondish hearts, have an API for fetching Bing search engine results, which has a free tier of 5000 queries per month (this will cover you for about 250 sets of 1000 results). However, it only returns the first 1000 results for any query, so you can't just search "site:dyingsite.com" and get all the things on a site. You'll need to get a bit creative with the search terms.
Grab this Python script (look for "BING_API_KEY" and replace it with your "Primary Account Key"), and then:
python bingscrape.py "site:dyingsite.com" >> bing-sitelist.txt python bingscrape.py "about me site:dyingsite.com" >> bing-sitelist.txt python bingscrape.py "gallery site:dyingsite.com" >> bing-sitelist.txt python bingscrape.py "in memoriam site:dyingsite.com" >> bing-sitelist.txt python bingscrape.py "diary site:dyingsite.com" >> bing-sitelist.txt python bingscrape.py "bob site:dyingsite.com" >> bing-sitelist.txt
And so on.
Common Crawl Index
The Common Crawl index is a very big (21 gigabytes compressed) list of URLs in the Common Crawl corpus. Grepping this list may well reveal plenty of URLs to archive. The list is in an odd format; along the lines of com.deadsite.www/subdirectory/subsubdirectory:http
so you'll need to some filtering of the results. The results can sometimes be ambiguous.
grep '^com\.dyingsite[/\.]' zfqwbPRW.txt > commoncrawl-sitelist.txt
Our Ivan wrote a Python script which will take your list of URLs on standard input and print out a list of normally-formed URLs on standard output.
You can also use the Common Crawl URL search and get the results as a JSON file. Quick-and-dirty grep/sed parsing:
grep -F '"url":' locations.json | sed 's/.*url": "\([^"]*\).*/\1/' | sort | uniq > commoncrawl-sitelist.txt