Difference between revisions of "INTERNETARCHIVE.BAK/torrents implementation"

From Archiveteam
Jump to navigation Jump to search
(some problems)
Line 33: Line 33:


> Except, there are millions of IA items.. Even with a custom controller, that presents problems such as: scalability when loading half a million torrents in a torrent client; tracker scalability; analizing so many torrents to find ones that need more seeders assigned, etc  --closure
> Except, there are millions of IA items.. Even with a custom controller, that presents problems such as: scalability when loading half a million torrents in a torrent client; tracker scalability; analizing so many torrents to find ones that need more seeders assigned, etc  --closure
== Problems ==
As closure notes, a torrent swarm uses constant, low level of traffic to find out who has which pieces and whatnot, so a client will not be able to join thousands of swarms. This makes the "simpler" solution need more users each donating less storage so as not to swamp each contributor's bandwidth and local memory. Possibly this traffic drops off to manageable levels once the swarm consists entirely of seeders, provided that the seeders don't leave and rejoin the swarm too frequently. Someone will need to run the numbers and find out.
On the other hand, he also notes that dividing the archive up into fewer torrents would require quite a significant amount of disk space, as the items are not already organized along those lines.

Revision as of 05:02, 5 March 2015

Create 42000 chunks of 500 GB of the IA, each a zip file.

Make 42000 torrents.

Make an interface to suggest a torrent, at random (or the one most needing seeds), to a user.

Let users add one or more torrents, and seed.

Every 500 GB added/changed in the Internet Archive, make a new zip file, and torrent, and wait for some users to add that one. (Maybe needs a mechanism to ensure that users who have free space remember to check for new torrents.)

This seems like the simplest possible solution.

comments

Note that some bittorrent trackers have torrents that sum to a larger total size than this, seeded healthily. Their torrents tend to be smaller than 500 gb though.

The Geocities torrent, at 900 gb, was an exceedingly large torrent, and there was some trouble keeping it seeded.

At 500 GB, this leaves out users who have some smaller fraction of a disk available to donate. This might reduce contributors significantly. A smaller chunk size might be better.

The user needs to keep their torrent client running, or they won't be counted as a seed. Offline or rarely online storage can be used, but won't be counted. So counting seeds will undercounf the number of copies.

Someone needs to seed all these torrents in the first place for users to download. Who? The IA can't double their storage to store all those zip files.

.zip files don't recover well if some peice in the middle is missing. It would be better to use a file format that can allow extracting the available files when part of it is missing.

a simplification

Every IA item already has a torrent associated with it. The torrent includes the derived files, but that can be amended (each one could have the current torrent plus one that includes only original files.) The simplest possible solution then is to get a few seeders into each of these swarms (IA is used as a web seed). One way to accomplish that is to write a custom BitTorrent client which automates the process of deciding which swarms each user joins, allows the user to decide how much space to use, etc. A custom BitTorrent client wouldn't be a very simple thing on it's own, but it could be quite simple for users who just want to donate some space without having to think about BitTorrent.

This has the additional advantages of storing the backed-up files on disk in a format which is readily usable by the user, and of requiring little to no additional work on IA's part.

> Except, there are millions of IA items.. Even with a custom controller, that presents problems such as: scalability when loading half a million torrents in a torrent client; tracker scalability; analizing so many torrents to find ones that need more seeders assigned, etc --closure

Problems

As closure notes, a torrent swarm uses constant, low level of traffic to find out who has which pieces and whatnot, so a client will not be able to join thousands of swarms. This makes the "simpler" solution need more users each donating less storage so as not to swamp each contributor's bandwidth and local memory. Possibly this traffic drops off to manageable levels once the swarm consists entirely of seeders, provided that the seeders don't leave and rejoin the swarm too frequently. Someone will need to run the numbers and find out.

On the other hand, he also notes that dividing the archive up into fewer torrents would require quite a significant amount of disk space, as the items are not already organized along those lines.