Difference between revisions of "Talk:Angelfire"

From Archiveteam
Jump to navigation Jump to search
(brainstorming)
 
Line 48: Line 48:


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.
Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.
-----
Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view

Revision as of 18:09, 8 May 2015

Some brainstorming from procrastination:

First grab all the sitemap indexes: curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls

http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...


Use that to grab all the sitemaps: wget -i sitemap-index-urls

<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...


Extract the urls: zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls

http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...


And grab them all: wget --force-directories -i sitemap-urls


TODO: Find a smart way to grab everything from that.

You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.



You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.


Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view