Difference between revisions of "Splinder"

From Archiveteam
Jump to navigation Jump to search
(→‎Notes: more)
Line 30: Line 30:
* Download speeds from splinder.com are not that high (servers may be particularly overloaded during European day because of additional traffic of people exporting their blogs). You can run multiple clients to speed things up.
* Download speeds from splinder.com are not that high (servers may be particularly overloaded during European day because of additional traffic of people exporting their blogs). You can run multiple clients to speed things up.
* There are some problems with subdomains containing dashes[http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=626472]: if they fail on your machine (reported: wget compiled with +nls), for now stop and restart the script, someone else will do those users (although they seem to fail in part anyway).  
* There are some problems with subdomains containing dashes[http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=626472]: if they fail on your machine (reported: wget compiled with +nls), for now stop and restart the script, someone else will do those users (although they seem to fail in part anyway).  
*: Some such users: macrisa, -Maryanne-, it:SalixArdens, it:MCris, it:7lilla, it:thepinkpenguin, it:bimbambolina, it:lazzaretta, it:Hedwige, it:N4m3L3Ss, it:Barbabietole_Azzurre, it:celebrolesa2212, it:buongiono.mattina
*: Some such users: macrisa, -Maryanne-, it:SalixArdens, it:MCris, it:7lilla, it:thepinkpenguin, it:bimbambolina, it:lazzaretta, it:Hedwige, it:N4m3L3Ss, it:Barbabietole_Azzurre, it:celebrolesa2212, it:buongiono.mattina, it:DarkExtra


== Uploading your data ==
== Uploading your data ==

Revision as of 10:39, 17 November 2011

Splinder
Splinder homepage.png
URL http://www.splinder.com/[IAWcite.todayMemWeb]

http://www.us.splinder.com/[IAWcite.todayMemWeb]

Status Closing
Archiving status In progress...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Splinder.com has been the main blog hosting company in Italy for a while (see Wikipedia:it:Splinder). It was founded in 2001 and it hosts about half a million blogs and over 55 millions pages. Since 8th November, 2011 a warning on the home page says that no new PRO accounts are being created since the 1st June. The company has confirmed that the website will close on the 24th.[1]

How to help archiving

There is a distributed download script that gets usernames from a tracker and downloads the data.

Make sure you are on Linux, that you have curl, git, a recent version of Bash. Your system must also be able to compile wget.

  1. Get the code: git clone https://github.com/ArchiveTeam/splinder-grab
  2. Get and compile the latest version of wget-warc: ./get-wget-warc.sh
  3. Think of a nickname for yourself (preferably use your IRC name).
  4. Run the download script with ./dld-client.sh "<YOURNICK>"
  5. To stop the script gracefully, run touch STOP in the script's working directory. It will finish the current task and stop.

Notes

  • Compiling wget-warc will require dev packages for the various libraries that it needs. Most questions have been about gnutls; install the gnutls-devel or gnutls-dev package with your favorite package manager.
  • Downloading one user's data can take between 10 seconds and a few hours.
  • The data for one user is equally varied, from a few kB to several GB.
  • The downloaded data will be saved in the ./data/ subdirectory.
  • Download speeds from splinder.com are not that high (servers may be particularly overloaded during European day because of additional traffic of people exporting their blogs). You can run multiple clients to speed things up.
  • There are some problems with subdomains containing dashes[2]: if they fail on your machine (reported: wget compiled with +nls), for now stop and restart the script, someone else will do those users (although they seem to fail in part anyway).
    Some such users: macrisa, -Maryanne-, it:SalixArdens, it:MCris, it:7lilla, it:thepinkpenguin, it:bimbambolina, it:lazzaretta, it:Hedwige, it:N4m3L3Ss, it:Barbabietole_Azzurre, it:celebrolesa2212, it:buongiono.mattina, it:DarkExtra

Uploading your data

To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the ./upload-finished.sh script to upload your data. For example, run this in your script directory: ./upload-finished.sh batcave.textfiles.com::YOURNICK/splinder/

Status

There is a real-time dashboard where you can check the progress.

External links

Site structure

The users are identified by their usernames. Fortunately, the side provides a list of all users. Usernames are not case-sensitive, but there is a case preference.

Example URLs

User profile: http://www.splinder.com/profile/<<username>>

Example profile:
http://www.splinder.com/profile/difficilifoglie

View count on profile page:
http://www.splinder.com/ajax.php?type=counter&op=profile&profile=Romanticdreamer

Example of friends list paging: (160 per page, starting at 0)
http://www.splinder.com/profile/difficilifoglie/friends
http://www.splinder.com/profile/difficilifoglie/friends/160

Inverse friends (probably also paged):
http://www.splinder.com/profile/difficilifoglie/friendof

Link to blog: (note: not always the same as the username)
http://difficilifoglie.splinder.com/
http://learnonline.splinder.com/

Photo:
http://www.splinder.com/profile/difficilifoglie/photo
http://www.splinder.com/mediablog/wondermum/media/24544805

Video:
http://www.splinder.com/profile/wondermum/video
http://www.splinder.com/mediablog/wondermum/media/25737390

Audio:
Not a separate user feed, but only accessible via mediablog
http://www.splinder.com/mediablog/learnonline/media/25727030

Mediablog: combination of the audio + video + photo lists
http://www.splinder.com/mediablog/learnonline
(16 per page, starting at 0)
http://www.splinder.com/mediablog/learnonline/16

Mediablog has PowerPoint, Word files:
http://www.splinder.com/mediablog/learnonline/media/25641346
http://www.splinder.com/mediablog/learnonline/media/25546305
http://www.splinder.com/mediablog/learnonline/media/21901634
http://www.splinder.com/mediablog/learnonline/media/24875290

User avatar: grab url from profile page

Photo file: grab url from photo page and remove _medium to get original picture
http://files.splinder.com/d5e492233631af39212268593afca02d_square.jpg
http://files.splinder.com/d5e492233631af39212268593afca02d_medium.jpg
http://files.splinder.com/d5e492233631af39212268593afca02d.jpg
older photos do not have this structure, different ids for each size:
http://www.splinder.com/mediablog/babboramo/media/17359043
http://files.splinder.com/13b615ccbd75354ee4e0d973da66c2b2.jpeg
http://files.splinder.com/770d7b9ecac27083d9204af327ebe743.jpeg

PowerPoint, Word files: grab url from media page
http://files.splinder.com/46dbf3d5a0b12e490f81ddb8444b4fad.ppt
http://files.splinder.com/ab3ce16c850ac530351d9df0937152c7.pdf

Video items: grab url from media page
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_square.jpg
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_thumbnail.jpg
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv
note: square, thumbnail, small is not always available, check flashvars for vidpath, imgpath
http://www.splinder.com/mediablog/babboramo/media/13131052
http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv
http://files.splinder.com/f56060b7fef139f03b72e06ca9fcba55.jpeg

Audio items: grab url from media page, flashvars
sometimes there is a _thumbnail, remove that to get a better quality
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3

Comments on blog posts:
http://www.splinder.com/myblog/comment/list/25742358
on some, but not on all blogs, those comments are also included in the blog page
http://dal15al25.splinder.com/post/25740180
http://soluzioni.splinder.com/post/2802227/blog-pager-su-piu-righe
http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/
http://civati.splinder.com/post/25742977
pagination: see media comments

Comments on media items:
http://www.splinder.com/media/comment/list/21254470
http://www.splinder.com/media/comment/list/21254470?from=50
(50 per page, starting at 0)
number of comments is on the media page
http://www.splinder.com/mediablog/danspo/media/21254470


Blog urls:
the blogs have content from their own subdomain, but also from
files.splinder.com
www.splinder.com/misc/ (topbar css, gif)
www.splinder.com/includes/ (js)
www.splinder.com/modules/service_links/ (images)
syndication.splinder.com

links to www.splinder.com that should NOT be followed:
 /myblog/
 /users/
 /media/
 /node/
 /profile/
 /mediablog/
 /community/
 /user/
 /night/
 /home/
 /mysearch/
 /online/
 /trackback/

wget-warc --mirror --page-requisites --span-hosts --domains=learnonline.splinder.com,files.splinder.com,www.splinder.com,syndication.splinder.com --exclude-directories="/users,/media,/node,/profile,/mediablog,/community,/user,/night,/home,/mysearch,/online,/trackback,/myblog/post,/myblog/posts,/myblog/tags,/myblog/tag,/myblog/view,/myblog/latest,/myblog/subscribe" -nv -o wget.log "http://learnonline.splinder.com/"