This wiki page is a collection of ideas for Project Valhalla.
This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset.
Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities.
This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.
- What options are out there, generally?
- What are the costs, roughly?
- What are the positives and negatives?
There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.
Join the discussion in #huntinggrounds.
- 1 Goals
- 2 What does the Internet Archive do for this Situation, Anyway?
- 3 Physical Options
- 4 Software Options
- 5 Non-options
- 6 Alternatives
- 7 From IRC
- 8 Where are you going to put it?
- 9 No, seriously, how are you going to actually DO it
- 10 What Can You Contribute?
- 11 Project-specific suggestions
- 12 See Also
- 13 References
We want to:
- Dump an unlimited amount of data into something.
- Recover that data at any point.
We do not care about:
- Immediate or continuous availability.
We absolutely require:
- Low (ideally, zero) human time for maintenance. If we have substantial human maintenance needs, we're probably going to need a Committee of Elders or something.
- Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.
It would be nice to have:
- No special environmental requirements that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)
What does the Internet Archive do for this Situation, Anyway?
This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.
The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements.
There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.
The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".
The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".
|Storage type||Cost ($/TB/year)||Storage density (m³/TB)||Theoretical lifespan||Practical, tested lifespan||Notes|
|Hard drives (simple distributed pool)||$150 (full cost of best reasonable 1TB+ external HD)||September 2014, best reasonable 1TB+ external HD is a 4TB WD. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.|
|Hard drives (dedicated distributed pool)||An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.|
|Hard drives (SPOF) ||$62 (but you have to buy 180TB)||For a single location to provide all storage needs, building a Backblaze Storage Pod 4.0 runs an average of $11,000, providing 180TB of non-redundant, not-highly-available storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)|
|Commercial / archival-grade tapes|
|Consumer tape systems (VHS, Betamax, cassette tapes, ...)|
|PaperBack||500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks. It would take 63.6 days of continuous printing to do this.|
|Optar||At 200KB per page, this has less than half the storage density of Paperback.|
|Blu-Ray||$40 (50 pack spindle of 25GB BD-Rs)||30 years||Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. Raidz3 with Blu-rays Doing a backup in groups of 15 disks. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!|
|M-DISC||Unproven technology, but potentially interesting.|
|Flash media||Very durable for online use, and usually fails from lots of writes. A drive might never wear out from cold-storage usage. Newer drives can have 10-year warranties. But capacitors may leak charge over time. JEDEC JESD218A only specifies 101 weeks (almost two years) retention without power, so we'd have to check the spec of the specific drives, or power them up and re-write the data to refresh it about once a year. Soliciting donations for old flash media from people, or sponsorship from flash companies?|
|Amazon Glacier||$122.88 (storage only, retrieval billed separately)||average annual durability of 99.999999999% ||Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).|
|Dropbox for Business||$160* ($795/year)||Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.|
|Box.com for Business||$180* ("unlimited" storage for $900/year)||Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.|
|Dedicated colocated storage servers||$100* (e.g. $1300 for one year of 12TB rackmount server rental)||Rent storage servers from managed hosting colocation providers, and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.|
Some of the physical options require supporting software.
Removable media requires a centralized index of who has what discs, where they are, how they are labeled, and what the process for retrieval/distribution is. It could just be a wiki page, but it does require something.
A simple pool of HDs ("simple pool"), one without a shared filesystem, just people offering up HDs, requires software running on Windows, Linux and/or Mac hardware to allow Archive Team workers to learn who has free disk space, and to save content to those disks. This could be just an IRC conversation and SFTP, but the more centralized and automated, the more likely available disk space will be able to be utilized. Software that is not cross-platform cannot be used here.
A simple distributed and redundant pool of HDs ("distributed pool") requires software running on Windows, Linux and Mac hardware to manage a global filesystem or object store, and distribute uploads across the entire pool of available space, and make multiple copies on an ongoing basis to ensure preservation of data if a pool member goes offline. This has to be automated and relatively maintenance-free, and ideally low-impact on CPU and memory if it will be running on personal machines with multi-TB USB drives hanging off them. Software that is not cross-platform cannot be used here.
A dedicated distributed and redundant pool of HDs ("dedicated pool") requires a selection of dedicated hardware and disks for maximum availability, and software to run on that hardware to manage a global filesystem or object store. It has to be automated and relatively maintenance-free, but would be the only thing running on its dedicated hardware, and as such does not have to be cross-platform.
|Software name||Filesystem or Object Store?||Platform(s)||License||Good for which pool?||Pros||Cons||Notes|
|Tahoe-LAFS||Filesystem||Windows, Mac, Linux||GPL 2+||Distributed, dedicated||Uses what people already have, can spread expenses out, could be a solution done with only software||Barrier to leaving is non-existent, might cause data-loss even with auto-fixing infrastructure. Too slow to be a primary offloading site. ||Accounting is experimental, meaning "in practice is that anybody running a storage node can also automatically shove shit onto it, with no way to track down who uploaded how much or where or what it is" -joepie91 on IRC|
|Ceph||Object store, Filesystem||Linux||LGPL||Dedicated|
|GlusterFS||Filesystem||Linux, BSD, OpenSolaris||GPL 3||Dedicated|
|Gfarm||Filesystem||Mac, Linux, BSD, Solaris||X11||Dedicated|
|Quantcast||Filesystem||Linux||Apache||Dedicated||Like HDFS, intended for MapReduce processing, which writes large files, and doesn't delete them. Random access and erasing or moving data around may not be performant.|
|GlusterFS||Filesystem||Mac, Linux, BSD, Solaris||GPL 3||Dedicated|
|HDFS||Filesystem||Java||Apache||Distributed, dedicated||Like Quantcast, intended for MapReduce processing, which writes large files, and doesn't delete them. Random access and erasing or moving data around may not be performant.|
|MogileFS||Object store||Linux||GPL||Dedicated||Understands distributing files across multiple networks, not just multiple disks||As an object store, you can't just mount it as a disk and dump files onto it, you have to push them into it through its API, and retrieve them the same way.|
|Riak CS||Object store||Mac, Linux, BSD||Apache||Dedicated||S3 API compatible||Multi-datacenter replication (which might be what you consider having multiple disparate users on different networks) is only available in the commercial offering.||A former Basho employee suggests this might not be a good fit due to the high latency and unstable connections we'd be dealing with. Datacenter-to-datacenter sync is an "entirely different implementation" than local replication, and would require the enterprise offering.|
|MongoDB GridFS||Object store||Windows, Mac, Linux||AGPL||Distributed, dedicated|
|LeoFS||Object store||Mac, Linux||Apache||Dedicated||S3-compatible interface, beta NFS interface, supports multi-datacenter replication, designed with GUI administration in mind|
|BitTorrent Sync||Synchronization||Windows, Mac, Linux, BSD, NAS||Proprietary||Simple||Commercially supported software||As straight synchronization software, it mirrors folders across devices. Individual users would have to make synched folders available to get copies of archives, and then they would be mirrored, and that's it.||Synchronization software in general is not the right solution for this problem.|
|Syncthing||Synchronization||Windows, Mac, Linux, BSD, NAS||GPL||Simple||Open Source Software, active Developement, individual rights||As straight synchronization software, it mirrors folders across devices. Individual users would have to make synched folders available to get copies of archives, and then they would be mirrored, and that's it. Rightsmanagment allows to only download but not change the files in the cloud.||Synchronization software in general is not the right solution for this problem.|
|BitTorrent||Filesystem||All||various||Distributed||Readily available technology, easily understood distribution model, contributors can join or leave at any time||Harder to get people interested in contribute if they have to join bittorrent swarms||Breaking a large item up into smaller torrents makes contributing smaller chunks of space possible, and a custom client could be created which would let the user dedicate some space and automatically join the swarms which have the fewest peers. Getting the initial seeds requires coordination to distribute the data across available seeds by other means, creating the sub-torrents, etc.}
For completeness sake:
<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years <Drevkevac> if anyone wants to drop in, I will pastebin the chat log <rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure. <rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good. <rat> find a hard drive from 30 years ago and see how well it hooks up ;) <brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so? <SketchCow> Please add paperbak to the wiki page. <SketchCow> Fuck Optical Media. not an option;. <Drevkevac> that would give you ~300GB per disk group, with 3 disks
Where are you going to put it?
Okay, so you have the tech. Now you need a place for it to live.
Multiple copies would of course be great.
No, seriously, how are you going to actually DO it
There are only a few practical hardware+software+process combinations. In order of cost to each volunteer:
This probably requires a minimum of three volunteers per TB per project. Probably best to pre-split the data into < 25GB chunks so each disc can be labeled the same and expected to have the same data on it. Fifty 25GB discs is a little more than a TB, and it's expected you'll lose a few to bad burns each time, but it might be worth buying more than a spindle and generating parity files onto additional discs.
Same as with Blu-rays, and not really any more expensive ($150 == $37.50 for one 1TB of Blu-rays * 4, or one 4TB HD), except look at all that disc-swapping time and effort you don't have to do. You don't have to split data into chunks, but you do want to download it in a resumable fashion and verify it afterwards, so, checksums, parity files, something. You also risk losing a lot more if a drive fails, and the cost per-volunteer is higher (replacing a whole drive versus replacing individual discs or spindles). As such, you still probably want a minimum of three volunteers per TB per project (so a 2TB project needs six volunteers with 1TB each, not three volunteers holding all 2TB each).
These units provide dramatically improved reliability for content, enough that perhaps you only need two volunteers per project, and no need to split by TB, since each volunteer would have two copies. Having everyone buy the same hardware means reduced administration time overall, especially if custom scripts are involved. QNAP and Synology both have official SDKs, and all of them run some flavor of Linux, with Synology supporting SSH logins out of the box. The Pogoplug is the most underpowered of the options, but even it should be powerful enough to run a MogileFS storage node, or a script that downloads to one HD and copies to the other. (Checksums would be really slow, though.) This is moderately expensive per-volunteer, with an upfront cost of $320-$500.
Consumer NAS devices have severe firmware issues, potentially causing full data loss on a trivial operation. Such a case was previously observed after flashing a new official firmware image onto a QNAP Pro series 4 bay NAS (700€ empty) while the RAID was presumably resyncing. It has to be expected that the device prefers reinitialization over being stuck with an error.
HDD compatibility is limited and has needs close investigation, WD Green 2TB for example tend to frequently degrade the RAID array and accumulate load cycles from frequent head parking.
A set of volunteers with (comparatively) expensive network-attached storage gives you a lot of storage in a lot of locations, potentially tens of redundant TB in each one, depending on the size of the chassis. You want everyone running the same NAS software, but the hardware can vary somewhat; however, the hardware should all have ECC RAM, and the more the better. MogileFS storage nodes are known to run on NexentaStor, and FreeNAS supports plugins, so it could be adapted to run there, or you could figure out e.g. LeoFS (which also expects ZFS). This is the most expensive option per-volunteer, upfront costs starting at around $1300 for a DIY box with four 4TB WD Red drives.
A rented server has no hardware maintenance costs; replacing a failed HD is the responsibility of the hosting provider, both in terms of materials cost and in labor cost. This is not the case with a purchased server, where someone would have to buy a replacement hard drive, bring it to the colocation center, and replace the drive; or someone would have to buy a replacement disk, ship it to the colocation center, and then they would bill someone for the labor involved in replacing it.
What Can You Contribute?
Twitch.tv (and other video services)