THE GEOCITIES GRAB FAQ
This FAQ has been written because an awful lot of people got wind of the newest project on the Archive Team plate, which is to attempt to mirror as much of the doomed website GeoCities as quickly as possible. Press coverage means that a lot of people are coming around and asking how they can help or what is up. This is a current listing of Jason's general answers to these questions. You can write to him at firstname.lastname@example.org, post to his page on this site, or go to EFNet and join channel #archiveteam.
These questions will likely change, as will the answers.
GENERAL QUESTIONS ABOUT THE PROJECT
What are you trying to do?
Simply put, we're trying to capture a copy of the family of websites known as "GeoCities" before its parent company, Yahoo!, takes the site down completely.
Why are you doing this?
We happen to believe that GeoCities represents a rather important point in the growth of the world wide web (and Internet in general): Many thousands of people came online and were given the ability to create their own web pages, to be seen by a potential worldwide audience, and set out to do just that. Some sites were terrible, and some were brilliant, but they all are of their time - some of these sites have been maintained to the present day but thousands were left alone for the last decade and represent a time capsule of the mid 1990s Internet. Deleting these pages, some of which were curated by people no longer with us, or by people who have completely forgotten the work they did, seems a shame.
When does Yahoo! plan to take down GeoCities?
Currently, the main GeoCities site claims it will go down later in the summer, but gives no firm dates as to what that means. Jason Scott's opinion is that Yahoo will take down GeoCities to coincide with quarterly financial reports/earnings, meaning before June 30th. Traditionally, this is when most companies vainly attempt to show "progress" or "measures", and cutting off GeoCities would be a prime example. This is, however, just a guess - in the case of Yahoo! Pets, the site was taken down with absolutely no warning, and Yahoo! Briefcase, a site that is the dictionary definition of "tiny", was taken down with 30 days of semi-warning after ten full years of being up. So let's assume it will be sooner rather than later.
Are you Archive.org?
No, we're not.
Archive.org has the wayback machine - your job is done!
No, not true. The Wayback Machine (also known as the Internet Archive by some) is a wonderful, great tool and a historical marvel and precious resource, but it does not crawl every single last page in a website. Many sites on GeoCities are not on the Wayback machine (although many are) and so our work is only slightly redundant, and ideally will be easier to analyze and provide.
How about Google Cache - your job is done!
The Google Cache, contrary to some opinion, is not a long-term storage solution. Google removes caches anywhere from a few weeks to a few months after the site disappears - once GeoCities is down, the cache will go away. Understandably, there's always the chance that somewhere deep in the bowels of Google are copies of GeoCities, but we're not going to bank on that.
Where will the mirrored websites be accessible?
It is allready partly accessible under http://web.archive.org/web/*/http://put/your/url/here. Possibly some websites are not visible there yet, since archive.org is waiting 6 months before making the mirrored data available.
Data fetched independently by AT will be probably mirrored under http://geociti.es. However the scheme of the mirrored URLs below http://geociti.es is being decided upon at this moment (26. Oct 2009).
TECHNICAL QUESTIONS ABOUT THE PROJECT
How big a site are we talking about here?
Nobody connected with Archive Team has any hard data. Yahoo's Site Explorer claims 23 million pages, and based on downloads we're inclined to say something in the range of 10 Terabytes, but we could be wildly off in any direction. The problem is increased because you can't simply go to a GeoCities site and get everything in the account - some people would have stuff unlinked from anywhere else, and we're not going to find it under any of our methodologies. So in terms of front-end access, we're just going to download until we run out of stuff to download, and then we'll be able to tell you where we are.
As of April 29th, we had something in the range of 350 gigabytes of GeoCities sites, well into the mid six-figures.
How are you acquiring this data?
The short form is that about 12-24 people are using GNU Wget against the site to within an inch of its life. Wget is a very resilient utility and is very flexible in capturing data and maintaining it in a good form, and then analyzing it to find more connections.
Where are you keeping this data?
People who have the minimum amount of disk space needed (less than a terabyte at the moment, but soon to be two terabytes) are rsync'ing between each other as they go.
HOW YOU CAN HELP
Right now, things are generally under control, but you should feel free to come visit #archiveteam on the EFNet IRC network to come chat (it does get loud in there), or join this wiki.
Jason is accepting donations to buy hard drives; his paypal is email@example.com. This is not a tax-deductable donation - you're basically just giving him money, which he uses to buy drives. So don't do it if you don't like that.