Talk:Angelfire

From Archiveteam
Revision as of 16:52, 10 November 2014 by Schbirid (talk | contribs) (brainstorming)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Some brainstorming from procrastination:

First grab all the sitemap indexes: curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls

http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...


Use that to grab all the sitemaps: wget -i sitemap-index-urls

<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...


Extract the urls: zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls

http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...


And grab them all: wget --force-directories -i sitemap-urls


TODO: Find a smart way to grab everything from that.

You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.



You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.