INTERNETARCHIVE.BAK/torrents implementation

From Archiveteam
Jump to: navigation, search

Create 42000 chunks of 500 GB of the IA, each a zip file.

Make 42000 torrents.

Make an interface to suggest a torrent, at random (or the one most needing seeds), to a user.

Let users add one or more torrents, and seed.

Every 500 GB added/changed in the Internet Archive, make a new zip file, and torrent, and wait for some users to add that one. (Maybe needs a mechanism to ensure that users who have free space remember to check for new torrents.)

This seems like the simplest possible solution.

Except for one little problem: How do you get thousands of 500 gb torrents initially seeded? The IA cannot seed all those torrents at the same time itself; it would need to store all the zip files and this would double then IA's disk usage.

A solution is to start by making torrent #1. Get it seeded from the IA. Once there are enough peers that it's considered healthy, delete its zip file from the IA, and let the peers take over seeding it. The IA moves on to create torrent #2, etc.

What if torrent #1 eventually loses any seeds? In this situation, we can do one of these things:

  • Send out a call (possibly automated) for anyone who has a copy of the file to bring it online and get it seeded again.
  • Try to recreate the original torrent using files from the IA. If we succeed, start seeding it again from the IA until it gets enough healthy peers.
  • Sometimes we don't be able to recreate the torrent. IA items go dark, or are modified, and without the identical files that went in, we can't. So give up on torrent #1, and make a new torrent #1B that contains all the files we could find that were in torrent #1. Get #1B seeded.


Note that some bittorrent trackers have torrents that sum to a larger total size than this, seeded healthily. Their torrents tend to be smaller than 500 gb though.

The Geocities torrent, at 900 gb, was an exceedingly large torrent, and there was some trouble keeping it seeded.

At 500 GB, this leaves out users who have some smaller fraction of a disk available to donate. This might reduce contributors significantly. A smaller chunk size might be better.

The user needs to keep their torrent client running, or they won't be counted as a seed. Offline or rarely online storage can be used, but won't be counted. So counting seeds will undercounf the number of copies.

.zip files don't recover well if some peice in the middle is missing. It would be better to use a file format that can allow extracting the available files when part of it is missing.

a simplification

Every IA item already has a torrent associated with it. The torrent includes the derived files, but that can be amended (each one could have the current torrent plus one that includes only original files.) The simplest possible solution then is to get a few seeders into each of these swarms (IA is used as a web seed). One way to accomplish that is to write a custom BitTorrent client which automates the process of deciding which swarms each user joins, allows the user to decide how much space to use, etc. A custom BitTorrent client wouldn't be a very simple thing on it's own, but it could be quite simple for users who just want to donate some space without having to think about BitTorrent.

This has the additional advantages of storing the backed-up files on disk in a format which is readily usable by the user, and of requiring little to no additional work on IA's part.

> Except, there are millions of IA items.. Even with a custom controller, that presents problems such as: scalability when loading half a million torrents in a torrent client; tracker scalability; analizing so many torrents to find ones that need more seeders assigned, etc --closure


As closure notes, a torrent swarm uses constant, low level of traffic to find out who has which pieces and whatnot, so a client will not be able to join thousands of swarms. This makes the "simpler" solution need more users each donating less storage so as not to swamp each contributor's bandwidth and local memory. Possibly this traffic drops off to manageable levels once the swarm consists entirely of seeders, provided that the seeders don't leave and rejoin the swarm too frequently. Someone will need to run the numbers and find out.

On the other hand, he also notes that dividing the archive up into fewer torrents would require quite a significant amount of disk space, as the items are not already organized along those lines.