Difference between revisions of "Talk:Angelfire"

From Archiveteam
Jump to navigation Jump to search
 
Line 1: Line 1:
Some brainstorming from procrastination:
First grab all the sitemap indexes:
curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls
<pre>
http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...
</pre>
Use that to grab all the sitemaps:
wget -i sitemap-index-urls
<pre>
<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...
</pre>
Extract the urls:
zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls
<pre>
http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...
</pre>
And grab them all:
wget --force-directories -i sitemap-urls
TODO: Find a smart way to grab everything from that.
You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.*
Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.
-----




Line 50: Line 8:


-----
-----
Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view
-----
some users have blogs, like this in the sitemap: http://filesha.angelfire.com/blog/index.blog

Latest revision as of 07:38, 9 May 2015


You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.