User:Nemo bis/Tumblr

From Archiveteam
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Simple script running for Tumblr with tmux on 8-core virtual servers with 1 TB disk (e.g. from a GCE trial). See concurrency suggested by trvz.

Partition and format the disk:

 sudo fdisk /dev/sdb ; sudo mkfs.ext4 /dev/sdb1

Install all the things:

 sudo apt install -y atop tmux iftop git-core libgnutls28-dev lua5.1 liblua5.1-0 liblua5.1-0-dev screen python-dev python-pip bzip2 zlib1g-dev flex autoconf ; sudo pip install seesaw; cd /mnt ; sudo mkdir at ; sudo chown nemobis at ; sudo mount /dev/sdb1 /mnt/at ; cd /mnt/at ; git clone https://github.com/ArchiveTeam/tumblr-grab.git ; cd tumblr-grab ; ./get-wget-lua.sh

Launch the scripts in tmux windows in a single session (a concurrency of 200 over 10 directories):

 tmux new-session -d atop ; for i in {1..10}; do tmux new-window -n t$i -d " cd /mnt/at ; git clone https://github.com/ArchiveTeam/tumblr-grab.git tumblr$i ; cd tumblr$i ; run-pipeline --concurrent 20 --disable-web-server --auto-update  pipeline.py Nemo" ; done

Open the terminal (see other tmux basics and how to relaunch en masse):

 tmux a
 # press "p" in atop to see  how much CPU wget-lua is consuming overall etc.

The main limit to speed is often the number of IP addresses and I/O wait time, rather than CPU and concurrency: see diggan's scripts to spawn warriors on multiple DigitalOcean instances.