MyVIP/Bash script

From Archiveteam
Jump to: navigation, search

The following notes, preceding the actual script, have been copied from the myVIP page, and most of it should be considered deprecated, along with the script afterwards; these are kept for historical purposes only.


Some notes about the site, the script and the archiving process:

  • The site utilizes a lot of javascript but still can be saved perfectly.
  • Scraping persons will need to register an account, and the cookies file (exported from the browser) must be fed with the script. Note: visiting myvip.com in a new browser session instantly invalidates the old cookie!
  • The total size of content is probably around 2–3 terabytes. The first (oldest) 100,000 profiles show that 47% of the possible profiles exist, and those have an average profile size of ~790 kilobytes, with the largest ones being a few tens (less than 100) megabytes (WARC compressed). Note that this is a rough estimate with a small sample. (That would mean that the profiles would be like ~2 TB, not counting the profile pic thumbnails and the clubs, but those are probably not too significant in size.)
  • The bash script is an amateur work, might be bash-specific at some points (i.e. not usable with other shells).
  • However, the script has been tested, should be reliable and do its job.
  • The script currently supports only profiles. Club pages should also be saved later, having this algorithm makes it simpler to write the one needed for that.
  • The script accepts userids, that go from 1 up to like 4,600,000 (?) sequentially, but not all profiles exist.
  • The script currently saves each user's stuff into separate WARC files (should be changed, as one such WARC file might be too little, resulting in lots of little files).
  • The script saves the followings of a user: profile page, list of clubs the user is a member of (if more than one page), acquaintances (or "friends") list, photo albums, photos, comments on photos. That's all that should be saved, if any.
  • The script supports creating a "directory" of users: it extracts some identifying information and stores in a one-line CSV file per user. (Should be adjusted just like the WARC; later they can be concatenated to form a database.)
  • The script also creates lists of profile picture and club avatar thumbnails that are used in lists on the site. They could be saved for every user, but that would mean that a profile picture would be requested as many times as many acquaintances the user has. So, creating a list of them and then downloading all those tiny pics only once is the feasible solution.
  • The script currently has a(n almost) separate discovery and grab phase. This means that some (many) pages are requested twice: while discovery and while WARCing. This could be probably optimized.
  • A user's acquaintances list is a problematic point. When first visiting the list (clicking "Ismerősök"), an alphanumeric pager ID is generated. The request for the other pages of the list needs this pager ID. However, a new request for the initial pager ID invalidates the earlier one! Also, the pager ID expires in 20 minutes – that is, all pages of a user's acq. list must be saved in 20 minutes. (This is why the script currently does it strictly in one separate phase in the end, and that the initial page is grabbed separately, to find out the current pager ID.)
  • The site should be saved in Hungarian. There is an English language option, but how it works hasn't been tested out. (Is it automatically set to English when visited outside of Hungary? Does the site remember the setting? Is the setting sent in a cookie or in the URL? etc.)
  • The script uses wget for discovery. It's much faster, but it's not immune to DNS resolution errors (doesn't retry), that's why a separate bash function for fetching with wget.
  • The script uses wpull for grabbing (WARCing), because it's much more intelligent than wget. (The wget-lua version could probably also be used, though, but that needs some coding.)
  • The script often checks whether we still are logged in. If not, then the item – depending on which phase we are in – pauses (sleep) or fails.
  • The bash script doesn't support running multiple instances of it, in its current state. (However, there is probably no obstacle server-side in the way of doing so with a proper script. There is a little glitch with colliding pagers, that results in some 302 redirections, but this doesn't seem to change user experience nor archival, just let it redirect. – Anyway, concurrency of 1 is recommended, but not absolutely necessary.)
  • A list of static files (that need to be downloaded only once) is here.

For more info, see the code. Further questions should be addressed to user:bzc6p, either on this page's talk page, or on his talk page.


MyVIP archiving bash script, written by user:bzc6p. Needs to be rewritten to conform ArchiveTeam framework and standards.

#!/bin/bash
# Discovers and downloads user content belonging to given user ID
# Accepts one or two paramters: a single id or two ids, in the latter case does the range.
# Creates a WARC file with the profile content and a csv file with one line containing some identifying information about the user.
# Avatar pictures' links are collected for future downloading.

abort_wpull ()	# if wpull is redirected, content is wrong and therefore we shouldn't go on
{
  echo "> Wpull grabbed wrong pages last time, you probably have lost authentication or something other weird happened. Check the logs before going on."
  echo "> Aborted."
  rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 acq_list list db myvip_script_lock 2>/dev/null
  mkdir ERROR 2>/dev/null
  mv $2.warc.gz ERROR
  mv $2.csv ERROR
  mv log.txt ERROR/log_$1.txt
  rm avatars/*av_$1
  unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS
  unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS
}

fetch ()     # In case wget has a DNS error (doesn't retry) or we've lost authentication.
{
    while [ true ]
    do
        wget $WGET_OPTS -O $2 $1 || { echo "> Probably an error in the connection. Sleeping 1 minute..."; sleep 60; continue; }     
        if [ `grep "<span class=\"btn-text\">Bejelentkezés</span>" $2 | wc -l | cut -d" " -f 1` -gt 0 ]; then
            echo "> You have lost your authentication! Log in and export your cookies file again!"
            echo "> Sleeping 1 minute..."
            sleep 60
            echo "> Retrying..."
            continue
        fi
        break
    done
}

echo "*** myVIP user backup script ***"
[[ $1 =~ `echo "^[0-9]+$"` ]] || { echo "> First parameter wrong!"; echo "> Aborted."; exit 1; }
[[ -z $2 ]] || [[ $2 =~ `echo "^[0-9]+$"` ]] || { echo "> Second parameter wrong!"; echo "> Aborted."; exit 1; }
[[ -z $2 ]] || [[ $1-$2 -le 0 ]] || { echo "> Parameters wrong!"; echo "> Aborted."; exit 1; }
echo "> Looking for wpull..."
wpull --version > wpull_ver 2>/dev/null || { echo "> You don't have wpull installed! wpull is necessary for the script to run!"; echo "> Aborted"; rm wpull_ver; exit 1; }
[ `cat wpull_ver | cut -d"." -f 1` -lt 1 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; }
[ `cat wpull_ver | cut -d"." -f 1` -eq 1 -a `cat wpull_ver | cut -d"." -f 2` -lt 2 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; }
rm wpull_ver
echo "> Checking authentication..."
if [ `wget --load-cookies cookies.txt -q -O - http://myvip.com/profile.php | grep "Adatlap" | wc -l | cut -d" " -f 1` -lt 1 ]; then
    echo "> Authentication failed. Check your cookies file or your internet connection."; echo "> Aborted."; exit 1
fi
cat myvip_script_lock >/dev/null 2>/dev/null && { echo "> Another myVIP backup script seems to be running! Multiple instances of the script MUST NOT be run at the same time!"; echo "> It is possible though that the last run interrupted. If you are sure no other myVIP backup script is running, issue 'rm myvip_script_lock' and retry."; echo "> Aborted."; exit 1; }
touch myvip_script_lock
mkdir avatars warcs logs index 2>/dev/null
MYVIP_A=$1
if [[ -z $2 ]]; then
    MYVIP_B=$1
    echo "> Backing up myVIP user profile $MYVIP_A"
else
    MYVIP_B=$2
    echo "> Backing up myVIP user profiles ${MYVIP_A}–${MYVIP_B}"
fi
WPULL_OPTS="--exclude-domains static.myvip.com,avatar.myvip.com --reject-regex infobar_frame|banner_bottombanner_frame -a log.txt --retry-connrefused --retry-dns-error --tries inf --waitretry 10 --timeout 30 --no-robots --progress none --load-cookies cookies.txt -p -H -Dmyvip.com --no-warc-keep-log --delete-after --database db --warc-append"      # options for wpull
WGET_OPTS="-q -a log.txt --retry-connrefused -e robots=off --tries 0 --waitretry 10 --timeout 30 --load-cookies cookies.txt"
for (( n = $MYVIP_A; n <= $MYVIP_B; n++ ))
do
    WARC_NAME=myvip_com_user_$n
    rm list acq_list 2>/dev/null
    echo "-------------------------------------------------------------------------------"
    unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS
    echo "> Fetching user page $n..."
    fetch `echo "http://myvip.com/profile.php?uid=$n"` "temp1"  # initial grab of user page  
    if [ `grep "Törölt, vagy nem létező felhasználó!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then       # if profile doesn't exist
	echo "> User profile doesn't exist, saving empty page..."
	echo ";;;;;http://myvip.com/profile.php?uid=$n" > $WARC_NAME.csv
	wpull $WPULL_OPTS --warc-file $WARC_NAME "http://myvip.com/profile.php?uid=$n"     # actual content grab
	if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then
	  abort_wpull $n $WARC_NAME
	  exit 1
	fi
	echo "> Empty profile page $n archived."
    else        # if user page exists
        echo "http://myvip.com/profile.php?uid=$n" >> list       # it will be grabbed
        # In the following lines, we parse the profile page for some identification information. Those of everyone will be put in an index so that if one looks for their profile, they can easily find them. Multiple fields are necessary because several people may have the same name, and not everyone fill in all the fields. The index can be hidden or truncated later; the script should build it anyway.
        # We'll use semicolon as field separator, so we replace the possible semicolons with commas
        MYVIP_NAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>név:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>név:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"`
        MYVIP_NICKNAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>becenév:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>becenév:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"`
        MYVIP_BIRTHDATE=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>születési idő:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>születési idő:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | cut -d" " -f1-3`
        MYVIP_PERM_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>lakhely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>lakhely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"`
        MYVIP_TEMP_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"`
        MYVIP_URL="http://myvip.com/profile.php?uid=$n"
        echo "$MYVIP_NAME;$MYVIP_NICKNAME;$MYVIP_BIRTHDATE;$MYVIP_PERM_ADDRESS;$MYVIP_TEMP_ADDRESS;$MYVIP_URL" | sed "s/&quot,/\"/g" | sed "s/&amp,/&/g" | sed "s/&lt,/</g" | sed "s/&gt,/>/g" > $WARC_NAME.csv       # decoding special characters; they go to a semicolon-seperated file
        echo "> Profile for user '$MYVIP_NAME' indexed."
        grep "loaded-image-userprofile_avatar" temp1 | grep -o "http[0-9a-zA-Z/\.?:_]*" | uniq | sed "s/\\\//g" >> list     # avatar pic
        if [ `grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | wc -l | cut -d" " -f 1` -gt 0 ]; then
            NUMCLUBS=`grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | cut -d"," -f 2`       # counting clublist pages
            if [[ ! $NUMCLUBS = "" ]]; then
		echo "> Parsing for club avatars..."
                for (( i = 0; i <= $NUMCLUBS; i++))
                do
		    echo -n $(($NUMCLUBS-$i))...	# print progress
                    echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n" >> list		# adding them to list
                    fetch `echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n"` "temp2"	# fetching to discover clubavatars
                    grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/clubav_$n
                done
                echo
            else
		grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n
            fi
        else
	    grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n
        fi
        grep -o "images.php?uid=[0-9]\+&imageid=[0-9]\+#imageview_container" temp1 | cut -d "'" -f 2 | sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list   # links to pictures on profile page
        if [ `grep "dousercontacts" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then     # does the user have acquaintances?
            echo "> User has no acquaintances."
            MAXPAGE=-1
        else
            echo "> Discovering acquaintances..."
            fetch `echo "http://myvip.com/search.php?act=dousercontacts&uid=$n"` "temp1"      # grabbing acq. list for discovering number of acq. pages  
            if [ `grep "rangepager-jump rangepager-jump-last rangepager-jump-disabled" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then      # does the acq. list have more than one page?
                MAXPAGE=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"&" -f 1 | rev | cut -d"=" -f 2 | cut -d"'" -f 1`        # number of acq. pages
                PAGER_PREFIX=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"'" -f 4 | rev | cut -d "&" -f 1-2`     # url prefix for acq. pages, including a unique pager id
                for (( i = 0; i <= $MAXPAGE; i++ ))
                do
		    echo -n $(($MAXPAGE-$i))...	# print progress
                    echo "http://myvip.com/$PAGER_PREFIX&p=$i" >> acq_list       # urls for acquaintances pages. WE'LL MODIFY AND GRAB LATER!
                    fetch `echo "http://myvip.com/$PAGER_PREFIX&p=$i"` "temp2" 	# discovering profile avatars
                    grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/profav_$n
                done
                echo
            else
		grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/profav_$n
                MAXPAGE=0
                echo "http://myvip.com/browse.php?act=browse&pager=phant0mpag3r1d3nt1f13r&p=0" >> acq_list
            fi
            #echo "> Found $(( $MAXPAGE + 1 )) pages of acquaintances."	# We've already printed progress, deprecated
        fi
        echo "> Discovering images..."
        echo "http://myvip.com/images.php?uid=$n" >> list
        fetch `echo "http://myvip.com/images.php?uid=$n"` "temp1"    # fetching images page for discovery
        if [ `grep "A felhasználónak nincs nyilvános albuma!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then       # does the user have images?
            echo "> User has no public images."
        else
            grep "images.php?albumid" temp1 | cut -d'"' -f 2 | cut -d"/" -f 2 | uniq > temp2      # collecting direct album links' postfixes
            # cut -d"=" -f 2 temp2 | cut -d"&" -f 1 > albumids_$n      # collecting albumids (probably not necessary)
            echo "> User has `wc -l temp2 | cut -d" " -f 1` public albums."
            sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" temp2 >> list      # add myvip.com prefix
            grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&imageid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list      # collecting browser thumbnail album links
            grep -o "'/images.php?uid=[0-9]\+&albumid=[0-9]\+'" temp1 | cut -d "'" -f 2 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq >> list     # get other kind of direct links to albums
            grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq > temp3      # collecting browser album links
            cat temp3 >> list     # we'll grab them too
            NUMALBUMS=`wc -l temp3 | cut -d" " -f 1`
            for (( h = 1; h <= $NUMALBUMS; h++))
            do
                echo "> Discovering content of album $h/$NUMALBUMS..."
                fetch `head -$h temp3 | tail -1` "temp4"      # fetch albums' embedded pages 
                grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" > temp5      # collect image page postfixes & add myvip.com prefix
                echo "> User has `wc -l temp5 | cut -d" " -f 1` images in this album."
                cat temp5 >> list      # add them to list
                grep -o "<div class=\"thumbnail-commentcnt\">[^<]*</div>" temp4 | cut -d">" -f 2 | cut -d"<" -f 1 > temp41   # list of number of comments
                grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 > temp42    # list of image pages, in the same order
                ALBUMID=`head -$h temp3 | tail -1 | grep -o "albumid=[0-9]\+" | cut -d"=" -f 2`
                NUMIMAGES=`wc -l temp41 | cut -d" " -f 1`
                for (( i = 1; i <= $NUMIMAGES; i++))
                do
                    if [ `head -$i temp41 | tail -1` -gt 20 ]; then
                        echo "http://myvip.com/images.php?imageid=`head -$i temp42 | tail -1 | cut -d'&' -f 2 | cut -d'=' -f 2`&albumid=$ALBUMID&uid=$n&isajax=1&getcontent=comments" >> list       # get comments
                    fi
                done
            done
        fi
        echo "> Downloading discovered content..."
	wpull $WPULL_OPTS --warc-file $WARC_NAME -i list     # actual content grab
	if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then
	  abort_wpull $n $WARC_NAME
	  exit 1
	fi
	if [ $MAXPAGE -ne -1 ]; then
	  echo "> Downloading acquaintances pages"
	  OLDPAGERID=`head -1 acq_list | grep -o "pager=[0-9a-z]*" | cut -d"=" -f 2`
	  echo "http://myvip.com/search.php?act=dousercontacts&uid=$n" > list	# one URL to find out current pager ID
	  wpull $WPULL_OPTS --warc-file $WARC_NAME -i list
	  if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then
	    abort_wpull $n $WARC_NAME
	    exit 1
	  fi
	  if [ `grep "Fetching ‘http://myvip.com/browse.php?pager=[0-9a-z]*&p=0’ encountered an error" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then
	    echo "> A rare problem occured. Grab of this user profile must be restarted."
	    rm acq_list temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db $WARC_NAME.warc.gz $WARC_NAME.csv log.txt avatars/profav_$n avatars/clubav_$n 2>/dev/null
            ((n--))
	    cat STOP 2>/dev/null && ((n=$MYVIP_B))
	    continue
	  fi
	  NEWPAGERID=`grep "pager" log.txt | tail -1 | cut -d"=" -f 2 | cut -d"&" -f 1`
	  sed -i -e "s/$OLDPAGERID/$NEWPAGERID/g" acq_list
	  mv acq_list list
	  wpull $WPULL_OPTS --warc-file $WARC_NAME -i list	# needed so that wpull surely uses the old database
	  if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then
	    abort_wpull $n $WARC_NAME
	    exit 1
	  fi
	fi
	echo "> myVIP profile of user '$MYVIP_NAME' (id $n) has been successfully archived!"
    fi
    rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db 2>/dev/null
    mv $WARC_NAME.warc.gz warcs
    mv $WARC_NAME.csv index
    mv log.txt logs/log_$n.txt
    cat STOP 2>/dev/null && ((n=$MYVIP_B))	# if STOP file is present, we stop the loop
done
unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS
unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS
rm myvip_script_lock
exit 0