Difference between revisions of "Valhalla"

From Archiveteam
Jump to navigation Jump to search
Line 20: Line 20:
We want to:
We want to:


* Dump an unlimited<ref>Take pedantry about "unlimited" to slashdot</ref> amount of data into something.
* Dump an unlimited<ref>Yes, unlimited means infinite.  This is one thing that makes this hard.  Take "impossible" to slashdot.</ref> amount of data into something.
* Recover that data at any point.
* Recover that data at any point.



Revision as of 17:25, 21 September 2014

Ms internet on a disc.jpg

This wiki page is a collection of ideas for Project Valhalla.

This project/discussion has come around because there is a class of data currently existing, several times a year, as a massive amount of data with "large, but nominal" status within the Internet Archive. The largest example is currently MobileMe, which is hundreds of terabytes in the Internet Archive system (and in need of WARC conversion), which represents a cost amount far outstripping its use. Another is TwitPic, which is currently available (and might continue to be available) but which has shown itself to be a bad actor with regards to longevity and predictability for its sunset.

Therefore, there is an argument that there could be a "third place" that data collected by Archive Team could sit, until the Internet Archive (or another entity) grows its coffers/storage enough that 80-100tb is "no big deal", just like 1tb of data was annoying in 2009 and now is totally understandable for the value, i.e. Geocities.

This is for short-term (or potentially also long-term) storage options, say five years or less, of data generated by Archive Team.

  • What options are out there, generally?
  • What are the costs, roughly?
  • What are the positives and negatives?

There has been a lot of study in this area over the years, of course, so links to known authorities and debates will be welcome as well.

Join the discussion in #huntinggrounds.

Goals

We want to:

  • Dump an unlimited[1] amount of data into something.
  • Recover that data at any point.

We do not care about:

  • Immediate or continuous availability.

We absolutely require:

  • Low (ideally, zero) human time for maintenance.
  • Data integrity. The storage medium must be impossibly durable or make it inexpensive/easy to copy and verify the data onto a fresh medium.

It would be nice to have:

  • No special environmental conditions that could not be handled by a third party. (So nobody in Archive Team would have to set up some sort of climate-controlled data-cave; however, if this is already something that e.g. IA does and they are willing to lease space, that's cool.)

What does the Internet Archive do for this Situation, Anyway?

This section has not been cleared by the Internet Archive, and so should be considered a rough sketch.

The Internet Archive primarily wants "access" to the data it stores, so the primary storage methodology is spinning hard drives connected to a high-speed connection from multiple locations. These hard drives are between 4-6tb (as of 2014) and are of general grade, as is most of the hardware - the theory is that replacing cheap hardware is better than spending a lot of money on super-grade hardware (whatever that may be) and not being able to make the dollars stretch. Hundreds of drives die in a month and the resiliency of the system allows them all to hot-swap in replacements.

There are multiple warehouses for storing the original books that are scanned, as well as materials like CD-ROMs and even hard drives. There are collections of tapes and CD-ROMs from previous iterations of storage, although they are thought of as drop-dead options instead of long-term archival storage - the preference is, first and foremost, the spinning hard drives.

The Archive does not generally use tape technology, having run into the classic "whoops, no tape drive on earth reads these any more" and "whoops, this tape no longer works properly".

The Archive has indicated that if Archive Team uses a physical storage method, such as tapes, paper, hard drives or anything else, that they are willing to store these materials "as long as they are exceedingly labelled".

Options

Storage type Cost ($/TB/year) Storage density (m³/TB) Theoretical lifespan Practical, tested lifespan Notes
Hard drives (simple distributed pool) $150 (full cost of best reasonable 1TB+ external HD) September 2014, best reasonable 1TB+ external HD is a 4TB WD. 25+ pool members would need one HD each plus a computer plus software to distribute data across the entire pool.
Hard drives (dedicated distributed pool) An off-the-shelf or otherwise specified, dedicated, network storage device used exclusively as part of a distributed pool.
Hard drives (SPOF) [2] $62 (but you have to buy 180TB) For a single location to provide all storage needs, building a Backblaze Storage Pod 4.0 runs an average of $11,000, providing 180TB of non-redundant, not-highly-available storage. (You really want more than one pod mirroring your data, but this is the most effective way to get that much storage in one place.)
Commercial / archival-grade tapes
Consumer tape systems (VHS, Betamax, cassette tapes, ...)
Vinyl
PaperBack 500KB per letter sheet means 1TB is 2,199,024 sheets, or ~4400 reams (500 sheets each), or an 8'x16' room filled with 6' tall stacks.
Optar At 200KB per page, this has less than half the storage density of Paperback.
Blu-Ray $40 (50 pack spindle of 25GB BD-Rs) 30 years[3] Lasts a LOT longer than CD/DVD, but should not be assumed to last more than a decade. Raidz3 with Blu-rays Doing a backup in groups of 15 disks. Comes to under $.04/GB which is cheap, and low initial investment (drives) too!


Specifically, a 50pack spindle of 25GB BD-Rs could readily hold 1TB of data for $30-50 per spindle. 50GB and 100GB discs are more expensive per GB.

M-DISC Unproven technology, but potentially interesting.
Flash media Wears out quickly, not-so-good long term storage. Soliciting donations for old flash media from people, or sponsorship from flash companies?
Glass/metal etching
Amazon Glacier $122.88 (storage only, retrieval billed separately) average annual durability of 99.999999999% [4] Retrieval is billed separately. 5% or less per month into S3 is free (5% of 100TB is 5TB), and data can be copied out from S3 to a SATA HD for $2.50/hr. plus media handling and shipping fees. Downloading 5TB from S3 would cost $614.40 (~$122.88/TB), but only $44.82 to transfer to HD via USB 3 or SATA (USB 2 is slower).
Dropbox for Business $160* ($795/year) Dropbox for Business provides a shared pool of 1TB per user, at $795/year (five user minimum, 5TB), and $125 each additional user/year.
Box.com for Business $180* ("unlimited" storage for $900/year) Box.com for Business provides "unlimited" storage at $15/user/month, five user minimum, or $900/year.
Dedicated colocated storage servers $100* (e.g. $1300 for one year of 12TB rackmount server rental) Rent storage servers from managed hosting colocation providers, and pool data across them. Benefits include bandwidth and electricity being included in the cost, and files could be made available online immediately. Negatives include needing to administer tens of servers.
Tahoe-LAFS

Non-options

  • Ink-based Consumer Optical Media (CDs, DVD, etc.)
    • Differences between Blu-Ray and DVD? DVDs do not last very long. The fact is, the history of optical writable media has been on of chicanery, failure, and overpromising while under-delivering. Some DVDs failed within a year. There are claims Blu-Ray is different, but fool me 3,504 times, shame on me.
  • BitTorrent Sync
    • Proprietary (currently), so not a good idea to use as an archival format/platform
  • Amazon S3 / Google Cloud Storage / Microsoft Azure Storage
    • Amazon S3 might be a viable waypoint for intra-month storage ($30.68/TB), but retrieval over the internet, as with Glacier, is expensive, $8499.08 for 100TB. Google's and Microsoft's offerings are all in the same price range.
  • Floppies
    • "Because 1.4 trillion floppies exists less than 700 billion floppies. HYPOTHETICALLY, if you set twenty stacks side by side, figure a quarter centimeter per floppy thickness, excluded the size of the drive needed to read the floppies you would still need a structure 175,000 ft. high to house them. Let's also assume that the failure rate for floppies is about 5% (everyone knows that varies by brand, usage, time of manufacture, materials used, etc, but lets say 5% per year). 70 million of those 1.4 trillion floppies are unusuable. Figuring 1.4 MB per floppy disk, you are losing approximately 100MB of porn each year. Assuming it takes 5 seconds to replace a bad floppy, you would have to spend 97,222 hrs/yr to replace them. Considering there are only 8,760 hrs per year, you would require a staff of 12 people replacing floppies around the clock or 24 people on 12 hr shifts. Figuring $7/hr you would spend $367,920 on labor alone. Figuring a nickel per bad floppy, you would need $3,500,000 annually in floppy disks, bringing your 1TB floppy raid operating costs (excluding electricity, etc) to $3,867, 920 and a whole landfill of corrupted porn. Thank you for destroying the planet and bankrupting a small country with your floppy based porn RAID." (source)

From IRC

<Drevkevac> we are looking to store 100TB+ of media offline for 25+ years
<Drevkevac> if anyone wants to drop in, I will pastebin the chat log
<rat> DVDR and BR-R are not high volume. When you have massive amounts of data, raid arrays have too many points of failure.
<rat> Drevkevac: I work in a tv studio. We have 30+ years worth of tapes. And all of them are still good.
<rat> find a hard drive from 30 years ago and see how well it hooks up ;)
<brousch_> 1500 Taiyo Yuden Gold CD-Rs http://www.mediasupply.com/taiyo-yuden-gold-cd-rs.html
<Drevkevac> still, if its true, you could do, perhaps, raidz3s in groups of 15 disks or so?
<SketchCow> Please add paperbak to the wiki page.
<SketchCow> Fuck Optical Media. not an option;.
<Drevkevac> that would give you ~300GB per disk group, with 3 disks

Where are you going to put it?

Okay, so you have the tech. Now you need a place for it to live.

Possibilities:

  • The Internet Archive Physical Warehouse, Richmond, CA
    • The Internet Archive has several physical storage facilities, including warehouses in Richmond, CA (home of the Physical Archive) and the main location in San Francisco, CA. They have indicated they are willing to take copies of Archive Team-sponsored physical materials with the intent of them being ingested into the Archive at large over time, as costs lower and 100tb collections are not as big a drain (or a rash of funding arrives elsewhere).
  • Living Computer Museum, Seattle, WA
    • In discussions with Jason Scott, the Living Computer Museum has indicated they will have physical storage available for computer historical materials. Depending on the items being saved by Archive Team, they may be willing to host/hold copies for the forseable future.
  • Library of Congress, Washington, DC
    • The Library of Congress may be willing to take a donation of physical storage, although it is not indicated what they may do long-term with it.

Multiple copies would of course be great.

Project-specific suggestions

Twitch.tv (and other video services)

  • Keep the original video files in (semi-)offline storage, and store transcoded (compressed) versions on the Internet Archive.

See Also

References

  1. Yes, unlimited means infinite. This is one thing that makes this hard. Take "impossible" to slashdot.
  2. The Internet Archive's cost per TB, with 24/7 online hard drives, is approximately $2000 for forever.
  3. On the basis of the described studies and assuming adequate consideration of the specified conditions for storage and handling, as well as verification of data after writing, we estimate the Imation CD, DVD or Blu-ray media to have a theoretical readability of up to 30 years. The primary caveat is how you handle and store the media. http://support.tdkperformance.com/app/answers/detail/a_id/1685/~/life-expectancy-of-optical-media
  4. "Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing." Maciej Ceglowski thinks that's kinda bullshit compared to the failure events you don't plan for, of course.