Difference between revisions of "Talk:INTERNETARCHIVE.BAK"

From Archiveteam
Jump to navigation Jump to search
Line 24: Line 24:
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)
** All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
*** All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
* "Dark" items (e.g. the "Internet Records" collection)
* "Dark" items (e.g. the "Internet Records" collection)
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
Line 30: Line 30:
* Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
* Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
** Interesting! Several solutions come to mind. --[[User:Jscott|Jscott]] 02:35, 2 March 2015 (EST)
** Interesting! Several solutions come to mind. --[[User:Jscott|Jscott]] 02:35, 2 March 2015 (EST)
* User bandwidth (particularly upstream)
* latency in swapping disks - assume we may be using cold storage
** Tiered storage? e.g. one for cloud, one for online trusted users' storage, and one for cold storage


== Project Lab and Corner ==
== Project Lab and Corner ==

Revision as of 16:42, 2 March 2015

A note on the end-user drives

I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --Jscott 00:01, 2 March 2015 (EST)

  • A possibility is that it's encrypted but easy to unencrypt, so that its harder to fake hashes to it but it can be unpacked into useful items even without the main support network there.

Potential solutions to the storage problem

  • Tahoe-LAFS - decentralized (mostly), client-side encrypted file storage grid
    • Requires central introducer and possibly gateway nodes
    • Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.
  • STORJ - blockchain based private cloud storage.
  • git-annex - allows tracking copies of files in git without them being stored in a repository
    • Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw

Right now, git-annex seems to be in the lead. Besides being flexible about the sources of the material in question, the developer is a member of Archive Team AND has been addressing all the big-picture problems for over a year.

Opinion from Joey Hess of GIT-ANNEX, which would kibosh it for the entire solution:

  • It would need to be under a million files. git gets janky with too many files in a repository. tar files are fine of course.
  • As the model is essentially a shared git repo that anyone in the world can write to, there will be bad actors. Stupid pushes would need to be filtered out.
  • You want periodic verification that nodes still have their content. In git-annex terms, a fsck. Currently git-annex does not record fsck results in the git repo, and I think it would need to for this application (it's doable)

Other anticipated problems

  • Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
    • Proposed solution: have multiple people make their own collection of checksums of IA files. --Mhazinsk 00:10, 2 March 2015 (EST)
      • All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
  • "Dark" items (e.g. the "Internet Records" collection)
    • There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
      • It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
  • Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
    • Interesting! Several solutions come to mind. --Jscott 02:35, 2 March 2015 (EST)
  • User bandwidth (particularly upstream)
  • latency in swapping disks - assume we may be using cold storage
    • Tiered storage? e.g. one for cloud, one for online trusted users' storage, and one for cold storage

Project Lab and Corner

  • Projects are much easier with the Internet Archive tool, available here.
  • There is a _files.xml in each item indicating what files are original and which are derivations.
  • Please step forward and write a script that, given a collection, finds all the items in that collection and adds up all the sizes of the original files.

Some good test collections to do this against:

  • Computer Magazines [1]
  • Software Library [2]
  • Ephemeral Films [3]