INTERNETARCHIVE.BAK/admin

From Archiveteam
Jump to navigation Jump to search

Server

The http://iabak.archiveteam.org server is provided by Kenshin. Closure and db48x are root. db48x set up graphite and some of the web page, and closure set up most of the rest.

Configuration management

Since we want to be able to scale to multiple server instances when the time comes, most (but possibly not all) of the server's configuration is handled by a configuration management system. Closure set this up using propellor (https://propellor.branchable.com) which he also wrote.

To set up your laptop to be able to configure the server with propellor, first install propellor on your laptop ("sudo apt-get install propellor"). Then "git clone https://github.com/ArchiveTeam/IA.BAK.git ~/.propellor". See the config.hs and IABak.hs files for the configuration.

To run propellor to update the server, you need to be able to ssh into the server as root. You also need to generate a gpg key, which will be used to access propellor's encrypted data. Generate the gpg key on your laptop by running "gpg --gen-key". Propellor will need to be told to add that key by someone who already has their gpg key added to propellor, by them running "propellor --add-key yourkeyid".

After making changes to the propellor configuration on your laptop, you can deploy them to the server by running "propellor --spin iabak.archiveteam.org"

Server scripts

The iabak git repository (https://github.com/ArchiveTeam/IA.BAK) has a "server" branch which contains scripts used on the server. This repository is checked out on the server in /usr/local/IA.BAK/

This is where the code for things like updating the stats on the web page, handling new user registration, and sending shard expiry warning emails, etc. lives.

Creating new shards

This assumes you have an account on the server. We're looking for SHARDMASTERS, so step right up..!

Creating a new shard is a five step process:

  • Pick collections to include in the shard.
  • Collect metadata about all the files in the items in those collections.
  • Create a shard git repository from the metadata.
  • Install the shard git repository on the server.
  • Update the repolist to include the repository, so clients will begin using it.

Pick collections

Pick some collections that need to be backed up, and have not been backed up before.

We don't have any kind of a list or index of already included collections -- yet -- so make sure existing shards don't already include your collections. On the iabak server, /srv/shard/ contains clones of all the shards git repos, so you can look around in there to check.

For scalability reasons, shards should not have more than around 100,000 files in them. And, it tends to work best for the total size of files in a shard to be in the 1-4 TB range. It takes some guesswork to pick a good combination of collections to meet these targets. Might take a few tries.

Closure generated some candidate shards, which all have a reasonable number of files in them, but the collections are machine-selected and the disk size of these may be too large or small. They are in /var/www/html/candidateshards/ on the server. (If you use one of these lists to create a shard, delete it afterwards to avoid dups.)

Collect shard metadata

For this, you will need a clone of the iabak git repository, with the server branch checked out. You can do this on any machine you like; it doesn't have to be the server.

Formerly we used metadata collected from a complete census of the Internet Archive. This is the 21gb file named md5_collection_url.txt.pick1.sorted.uniq on the server in /home/joey/IA.BAK.

   git clone git@github.com:ArchiveTeam/IA.BAK.git
   cd IA.BAK
   ln -s /home/joey/IA.BAK/md5_collection_url.txt.pick1.sorted.uniq

As the complete census is no longer maintained (new tools have replaced it), we keep it around mostly as a historical curiosity and a fall-back option.

You will instead want to collect up-to-date metadata by directly querying IA using their command-line tools. iamine is the simplest, and ia is newer and more complete. Examine the split-collection script for an example of how they are used, and for how we use jq to process the JSON metadata into the simpler TSV format we need.

   split-collection archivebot

As you build shards you will likely create your own scripts for manipulating them into the metadata files needed for the next step; check these into the repository as well so that we can all use them.

The output of this step will always be some tab-separated file with four columns, which the next step will use. This file lists all of the files that will be put into the git annex repository. The columns contain the MD5 hash of the file, the file size, the primary category containing the item this file is in, and the url of the file:

   e29cee71f76f62a9c64a98e1b7ad7a7a        3750726 archiveteam-fire        https://archive.org/download/2010-reddit-research/affinities.dump.bz2
   b7a682256a43d7b2b727b2477963fce5        6870    archiveteam-fire        https://archive.org/download/2010-reddit-research/mr_tools.py
   28794925f7e3a15f21b677e4a63df273        1526898 archiveteam-fire        https://archive.org/download/2010-reddit-research/affinities-matrix.tar.bz2
   06b927bcce0572e294d4fc67b0368bbf        1554    archiveteam-fire        https://archive.org/download/2010-reddit-research/srrecs.pig


Create shard git repository

Now you can run mkSHARD to create a shard. It takes two parameters. First parameter is a TSV file as described, the second is the shard number.

So, for example:

   ./mkSHARD /var/www/html/candidateshards/smallestfirst83.lst 12

It will take a while! Eventually, you'll get a SHARDn.git repository created in the current directory. The `git annex info` of the repository will also be displayed. Pay attention to the total size of the shard, and the number of files in it. If it's too big/too small, you can rm -rf the SHARDn.git and try again with a different set of collections.

Install shard git repository

Still in the same directory, run https://github.com/ArchiveTeam/IA.BAK/blob/server/setupshardrepo setupshardrepo] to install the shard. Its first parameter is the full name of the shard (e.g. "SHARD12"), and the second parameter is the full path to the shard's git repository.

For example:

   sudo ./setupshardrepo SHARD12 `pwd`/SHARD12.git

TODO: Obviously this needs sudo access, so something needs to be done to give SHARDMASTERs sudo access, or make it not need that..

Note also that the path to the git repository must be an absolute path (we ought to make the script smarter rather than relying on ourselves to remember that, but it's not actually been done yet).

This creates a user named e.g. SHARD12, and installs the repo in their home directory. It also sets up the ssh configuration for this new user to allow downloaders to clone the repository.

Update repolist

Finally, time to tell clients to use the new shard. In the iabak git repo, check out the master branch, and edit the repolist file. See repolist.README for details about this file. You will probably want to add the new shard in reserve state to start with.

For example:

   shard12 SHARD12@iabak.archiveteam.org:shard12 reserve

Commit and push, and as clients update they will become aware of the shard.

Also, the iabak website will add the shard to the display on its next update.

Adjusting repolist states

From time to time, shards get sufficiently backed up that they no longer need to be marked as active in the repolist, and can be set to maint. Or, a shard in maint may lose redundancy, and need to go back to active to get some more clients to use it. We generally want 2-3 shards in active mode at a time, and the rest in maint, with a few new shards in reserve. See the stats on the website to know when changes need to be made. Then just edit the iabak repository's repolist file, and commit it.

Trimming unavailable files

Sometimes a shard will get almost all files backed up to enough clients, but a few files will not get backed up at all. This can happen if the IA darks an item, or deleted a file after the survey, and so it's not available to download. This makes the stats look bad and wastes client time trying again and again to download the files.

One way to deal with this is to go into the git repository for the shard and delete the files that are not available from the IA. Commit and push it back, and done.

Thing is, the IABak system does not let such change to shard git repos be pushed in, because we don't want users messing with the shards. So, this has to be done on the server. There are checkouts of all the shard git repos under /srv/shard/ and changes can be made in there. IIRC, I have temporarily moved /home/SHARDn/shardn.git/hooks/update out of the way to allow git push of that change to work, of course putting it back afterwards. There is probably a better way.

(Something should be done to handle this automatically.)