From Archiveteam
Revision as of 16:52, 10 November 2014

Jump to: navigation, search

Some brainstorming from procrastination:

First grab all the sitemap indexes: curl | grep -Eo 'http.*gz' > sitemap-index-urls

Use that to grab all the sitemaps: wget -i sitemap-index-urls


Extract the urls: zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls

And grab them all: wget --force-directories -i sitemap-urls

TODO: Find a smart way to grab everything from that.

You will want --no-cookies and reject* Some images are hosted on and will require some smart hackery.

You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#' | sed 's#/sitemap.xml##' | sed 's#/#\t#'

Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.