Talk:Windows Live Spaces

From Archiveteam
Revision as of 05:31, 22 March 2011 by Swicher (talk | contribs) (HTTrack (graphic version): Add a couple more data about the project configuration.)
Jump to: navigation, search

Current Status

D-Day has been and gone, and as of March 20, 2011, Windows Live Spaces is still running. There are still millions of Spaces that have not been downloaded or migrated. Microsoft could shut down WLS at any moment and delete all the data, so we need as many people as possible to help download them all. As of March 2 2011, only 1,000,000 Spaces had been migrated to Wordpress,[1] so we have a lot of catching up to do.

Swicher is currently downloading several thousand Spaces using HTTrack. These Spaces are duplicated as the first few hotlists, to be sure we do get them.

Phase 1: CID Scraping

NovaKing is currently scraping Bing for more profiles. At the rate he's been going, he should have tens of thousands ready soon, which will be split up into hotlists and allocated to volunteers for downloading.

Phase 2: Downloading Hotlists

This is a list of available hotlists and their status. They are generally split into chunks of 1,000 Spaces.

If you would like to take ownership of one, speak to Auguste on IRC. Volunteers, please update this table as soon as you are finished, or let Auguste know if you are unable to complete it.

Filename Owner Size (GB) (compressed size) Status Status notes
wls 0001-1000.txt ersi 13.7~ GB (1.1GB bzip2) Complete Uploaded, awaiting verification
wls 1001-2000.txt ersi 20 GB (1.6GB bzip2) Complete Uploaded, awaiting verification
wls 2001-2202.txt Dr-Spangle Complete Awaiting upload.
wls 2203-3000.txt ersi 13 GB (945.3MB bzip2) Complete Uploaded, awaiting verification
wls 3001-4000.txt Jeroenz0r, joeyh (1.36GB gz) Complete Uploaded, awaiting verification
wls 4001-5000.txt amnesia 8.0G (unzipped)/1.7G (zipped) Complete Uploaded, awaiting verification
wls 5001-6000.txt Underscor In progress
wls 6001-7000.txt amnesia 336M (unzipped)/74M (zipped) Complete Uploaded, awaiting verification

Instructions

You have three options:

  • SpaceInvader2.pl
    • Downloads the list of Spaces, one Space at a time, using Wget. You probably want this one.
    • Usage: SpaceInvader2.pl "HOTLIST"
  • SpaceInvaderTurbo.pl
    • Spawns multiple instances of Wget to download everything at once. If you have a hotlist of 1,000 Spaces, this means 1,000 instances of Wget, all downloading simultaneously. This may be unfriendly to both your CPU and Microsoft's systems, but it will cut a 7-day job down to a few hours. Use it at your own risk.
    • Usage: SpaceInvaderTurbo.pl "HOTLIST"
  • spaceinvader.sh
    • Same idea, different implementation. Will run up to 50 wget instances and won't be that hard on your machine.
    • Actually does save some images.
    • Usage: spaceinvader.sh "HOTLIST"

Due to insufficient time and planning, these scripts don't download any off-site dependencies - most of what you download will be HTML/text. The upside is that it compresses nicely.

These scripts will just spit out files in the working directory, so you probably want to place them in ~/wls or something before executing them.

Once you have finished downloading a hotlist, please update your details in the above table and compress all the Spaces into a single archive, along with a copy of your hotlist. 7-Zip on maximum compression should be able to get them down to ~10% of their original size.

After compressing your hotlist, you can upload it to Underscor's FTP for temporary storage. Get the FTP details from him or Auguste. We still need to find some permanent storage to move everything to.

Phase 3: Storage

TBA.

Other Tools

Though Perl/Wget is the recommended method for archiving Spaces, there are a couple of other tools available.

HTTrack (graphic version)

I will explain what is the procedure to download one or more Spaces using HTTrack graphic version (WinHTTrack in Windows and in Linux is called WebHTTrack).

I assume that the reader should be familiarized with the use of WinHTTrack (or WebHTTrack) so I'll just explain that you need configure (in the Option Panel of the program) to download a Space of Windows Live Spaces. If you do not know how to use this program you can check this tutorial (in English) or this one (in Spanish).

In the section "Scan Rules" must be added the following lines:

+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
+*.7z
+*.pdf +*.doc +*.mid +*.3gp +*.djvu +*.amr +*.mp4 +*.ogg +*.ogv +*.ogm
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm +*.wav +*.vob +*.qt +*.vid +*.ac3 +*.wma +*.wmv
+*.zip +*.tar +*.tgz +*.gz +*.rar +*.z
+*.arj +*.dar +*.lzh +*.lz +*.lza +*.arc
+*.gif +*.jpg +*.png +*.tif +*.bmp
-*.entry#comment
+*.profile.live.com/Lists/*
+*.byfiles.storage.live.com/*
+*.photos.live.com
+*.spaces.live.com
  • Line 1 to 7 indicate what types of files are downloaded from a Space (if the program finds one these and this lines can be modified to suit the user).
  • Line 8 is because the program tries to capture the comments any post of a blog on Windows Live Spaces and this action generates errors (in addition to a waste of time when exploring a site).
  • Line 9 and 12 are used to capturing Spaces of the list of "friends" who might have the Space user which is capturing at that time (these lines are optional)
  • Lines 10 and 11 are to capture the files and photos that the user can have uploaded there.

I'm not sure the data in *.photos.live.com will continue to exist after Windows Live Spaces is shut down, so I took the opportunity to save any photos in this there anyway. If you don't want to save photos, that line is optional.

Then add in the field Browser "Identity" (from the section Browser ID) the following User Agent:

Googlebot/2.1 (+ http://www.googlebot.com/bot.html)

And finally in the section "Spider" select the option "no robots.txt rules".

Note that if you download thousands of Spaces in a single project of the program, also try to disable the option "Create Log files" in the section "Log files, Index, Cache" otherwise, their logs can weigh tens of GB on hard disk.

LSSaver

Some of the descriptions in this section was taken from http://www.softsea.com/review/LSSaver.html

LSSaver is a Windows Freeware software to save an Windows Live Space blog to your local disk. It saves useful informations such as, blog title, content and comments. It is able to save the pictures included in the blog to local disk also.

LSSaver is very simple to use, so that its operation is:

  • First, you need to enter a Microsoft Live Space username.
  • Then, you click on the "Get" button to retrive all blog entries. This operation may take up to several minutes depending on the number of entries that contains a blog, as well the user connection, when a blog entry is retrieved, it's title will appear in the tree which is the left part of the window. Wait until all titles are trieved. Then you can browse your blog titles by fold/unfold tree, check those you want to save. Once a blog entry is checked, it's content will appear on the right part of the window, check all blogs you want to save and wait until all of them appear.
  • To save the selected blogs, you simply click the Save button, a file selection window will open, select where the files will be saved and give a file name and click the Save button on the window, after a while, all the selected blogs are saved. The saved file is a HTML file, you can open it with a browser.

The program works as it should but we must take into consideration some details that differentiate it from any web site downloader:

  • As explained before, when the program save a blog all the articles (and comments) are crammed into an HTML file (which could become a problem if the blog has a lot of content).
  • The names of the images are stored as 000001, 000002, etc. thus avoiding that the original can be found on the Internet (this refers to the images of external sites linked in a blog) or recognize the file format.

Useful links

References