Talk:INTERNETARCHIVE.BAK

From Archiveteam
Jump to navigation Jump to search

A note on the end-user drives

I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --Jscott 00:01, 2 March 2015 (EST)

Potential solutions to the storage problem

  • Tahoe-LAFS - decentralized (mostly), client-side encrypted file storage grid
    • Requires central introducer and possibly gateway nodes
    • Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.
  • git-annex - allows tracking copies of files in git without them being stored in a repository
    • Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw

Right now, git-annex seems to be in the lead. Besides being flexible about the sources of the material in question, the developer is a member of Archive Team AND has been addressing all the big-picture problems for over a year.

Other anticipated problems

  • Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
    • Proposed solution: have multiple people make their own collection of checksums of IA files. --Mhazinsk 00:10, 2 March 2015 (EST)
    • All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
  • "Dark" items (e.g. the "Internet Records" collection)
    • There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
      • It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
  • Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
    • Interesting! Several solutions come to mind. --Jscott 02:35, 2 March 2015 (EST)