Places to store data

From Archiveteam
Revision as of 07:17, 5 November 2017 by JesseW (talk | contribs) (add something for all the categories)
Jump to navigation Jump to search

This is a list/essay about places to store data, of varying permanence, cost, scale, etc -- mainly for use of ArchiveTeam material, but could be for any.

The first and most obvious choice is IA, the Internet Archive. They accept pretty much anything, of any scale, intend to keep all of it forever, and will distribute a very large range of material, particularly if no-one grumbles at them about it. And aside from direct uploading, for a lot of the other places listed below, you can save a copy of the other page containing the data into the Wayback Machine using the Save Page feature.

But it's good to put stuff in other places, too -- duplication keeps stuff safe, and something can't be erased from history if all the copies can't be found.

The first thing to think about when looking for a place to store data is: How much data is it? There are lots more places to dump a few kilobytes than are willing and able to accept a petabyte that needs a loving home.

Some good size categories (in order of increasing size) include: a few hashes; a few pages of text; a short video clip; an optical disk (one or two gigabytes); a terabyte; a petabyte. Alternatively, with the Penn Jillette scale, a few; a bunch; a lot; a shitload; a motherfucking shitload; more than a motherfucking shitload.

Another important consideration is: human-readable text or not? While any byte sequence can be converted into human-readable (if boring) text, this generally expands it considerably, and tends to make people suspicious. So it's good to know what type of material a given place expects.

Below, grouped by size, are various suggested places to store data. Generally, all the bigger places can also be used for the smaller ones. Additional suggestions and commentary welcomed!

A few hashes

This size category refers to short strings; things like the toplevel hash of some larger pile of data, or a password or other secret, or a significant phrase.

  • Things this short can be stored into arbitrary website's log files, by simply appending them to the domain name; this will generate an error report containing the string. This doesn't provide any distribution, however.
  • Many sites usernames are flexible and long enough to be used as a distribution mechanism for such short strings. And while bigger sites have this happen often enough that they have ways to hide usernames they don't want to distribute, many other sites don't -- and all of them have to notice before they will take any action.
  • Short strings can be embedded in various crypto-currency blockchains, although this generally isn't free.
  • Various URL shortening sites allow for custom short codes; if the string is short enough (or the URL site is flexible enough), you could likely put it there, although that doesn't help much for distribution. The string could also be put in the destination URL.

A few pages of text

This size category refers to data in the kilobyte range; individual essays in plain text, single images, a large bunch of hashes, etc.

  • The obvious option here are pastebin sites, of which there are many. Make sure to select the "keep forever" option, and know that they probably won't, anyway. But they are (mostly) anonymous, free, and quick. And you can duplicate the data into the Wayback Machine with the Save Page feature, in most cases.
  • Image hosting sites are another option, although lacking a business model, they are generally not very good for long-term storage -- but converting text into an image (or vice-versa) is a useful way to create a harder-to-find version.
  • Many wikis can be (mis-)used to host arbitrary content, esspecially if you insert it into a existing page, and (ideally from a different account) promptly revert the page back to its previous content.
  • All of the ideas documented at the famed DeCSS Gallery (someone add a link, please) are options, although generally labor-intensive.

A short video clip

This size category refers to data in the tens to hundreds of megabyte range; video clips (not full movies), photo albums, the text content of entire blogs or small forums, etc.

  • Youtube is the obvious option here. But it isn't (entirely) a monopoly -- alternatives like Vimeo (and others even less well known) exist, and worth using when you are trying to get copies in as many places as possible.
  • Most social media and blogging services (Facebook, Blogger, Tumblr, Wordpress, etc.) will gladlly accept content of this scale, and will generally host it unless it gets really popular or attracts complaints.

An optical disk

This size category refers to data in the half to a few gigabytes range; CD-ROMs, DVD-ROMs, highly compressed movies, WARCs of small to medium websites, etc.

  • Cloud storage providers are obvious choices here: Google Drive, Amazon S3, Microsoft Azure, and the other less well known ones. Some are free, some aren't -- none are particularly anonymous, and all have limits/fees for additional distribution bandwidth.
  • As the title suggests, local storage on optical media is another feasible choice; this doesn't directly provide distribution (except via physical shippment), but has the advantage of not being findable online, which can be useful for controversial material.
  • Torrents can be useful in this size range (or smaller); while they are merely a distribution mechanism, if you can run a seedbox, they provide a way to make material available without relying on other providers.

A terabyte

This is 1000 gigabytes; a pile of losslessly compressed movies, WARCs of large (or very inefficient) websites, a library of printed works (with page images).

  • Few places (aside from IA) will store this much data for free -- although the cloud providers are delighted to sell you this level of storage.
  • Local hard drives can be a good choice, although keeping *two* local copies is a good idea, to avoid hardware failure.

A petabyte (More Than A Motherfucking Shitload)

This size category refers to data in the petabyte range; digital archives, file hosts, etc.

  • Few reliable resources exist when needs are in the petabyte range. While the Internet Archive does store petabytes, there's no way one will casually drop a petabyte on them and not raise eyebrows or ring alarm bells.
  • Other options like Backblaze B2, Amazon S3/Glacier, Microsoft Azure and others, are meant for enterprises and command enterprise pricing. (Hint: even on B2, a petabyte isn't cheap.)