Difference between revisions of "Talk:INTERNETARCHIVE.BAK"

Revision as of 06:36, 6 March 2015

A note on the end-user drives

I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --Jscott 00:01, 2 March 2015 (EST)

A possibility is that it's encrypted but easy to unencrypt, so that its harder to fake hashes to it but it can be unpacked into useful items even without the main support network there.

Potential solutions to the storage problem

Tahoe-LAFS

Tahoe-LAFS - decentralized (mostly), client-side encrypted file storage grid
- Requires central introducer and possibly gateway nodes
- Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.

git-annex

git-annex - allows tracking copies of files in git without them being stored in a repository
- Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw

Right now, git-annex seems to be in the lead. Besides being flexible about the sources of the material in question, the developer is a member of Archive Team AND has been addressing all the big-picture problems for over a year.

Full worked proposed design for using git-annex for this: https://git-annex.branchable.com/design/iabackup/ -- joeyh

Other

STORJ - blockchain based private cloud storage.
IPFS - "You can loosely think of ipfs as git + bittorrent + dht + web."
Permacoin - Repurposing Bitcoin Work for Data Preservation
Compact Proofs of Retrievability

Project Lab and Corner

Projects are much easier with the Internet Archive tool, available here.
There is a _files.xml in each item indicating what files are original and which are derivations.
Please step forward and write a script that, given a collection, finds all the items in that collection and adds up all the sizes of the original files.
- https://gist.github.com/EricIO/f77f094032110a7b51e7 running `python ia-collection-size.py <collection-name>` will give you the size of the original files and the total.

Some results so far:

Collection	Link to Collection	Number of Items	Total Size	Original Files Size	% of Total
Ephemeral Films	[1]	2932	10971882551213 (10.9tb)	9453160185702 (9.4tb)	86%
Computer Magazines	[2]	13066	3392870124693 (3.3tb)	1897118607284 (1.8tb)	55%
Software Library	[3]	27861	63140205942 (63.5gb)	61142015946 (61.5gb)	96%
Prelinger Archive	[4]	6477	14603406806901 (14.6tb)	13792309835153 (13.7tb)
Grateful Dead	[5]	10006

Rough Count

According to one of the internal counters, there are 24,598,934 "items" at the Archive. This number should be considered rough and suspect but can give some insight into the scope of the project. A more prev

Case Studies

If you implement it, will users use it?

BOINC

Why do people participate in BOINC projects?
Why do projects use BOINC?
How does BOINC keep track of work units?
How does BOINC deal with bad actors?
Why do BOINC projects share project users and points among other projects?
What makes people download the client software and install it?

Stack Overflow

Q & A sites existed before Stack Overflow. What makes Stack Overflow so successful?
How does Stack Overflow eliminate bad questions and answers?
What makes Stack Exchange grow so large?
How does it deal with spam?

@@ Line 64: / Line 64: @@
 ! Collection
 ! Link to Collection
+! Number of Items
 ! Total Size
 ! Original Files Size
@@ Line 70: / Line 71: @@
 | Ephemeral Films
 | [https://archive.org/details/ephemera]
+| 2932
 | 10971882551213 (10.9tb)
 | 9453160185702 (9.4tb)
@@ Line 76: / Line 78: @@
 | Computer Magazines
 | [https://archive.org/details/computermagazines]
+| 13066
 | 3392870124693 (3.3tb)
 | 1897118607284 (1.8tb)
@@ Line 82: / Line 85: @@
 | Software Library
 | [https://archive.org/details/softwarelibrary]
+| 27861
 | 63140205942 (63.5gb)
 | 61142015946 (61.5gb)
 | 96%
+|-
+| Prelinger Archive
+| [https://archive.org/details/prelinger]
+| 6477
+| 14603406806901 (14.6tb)
+| 13792309835153 (13.7tb)
+|-
+| Grateful Dead
+| [https://archive.org/details/GratefulDead]
+| 10006
 |}
 == Rough Count ==
-According to one of the internal counters, there are 24,598,934 "items" at the Archive. This number should be considered rough and suspect but can give some insight into the scope of the project.
+According to one of the internal counters, there are 24,598,934 "items" at the Archive. This number should be considered rough and suspect but can give some insight into the scope of the project. A more prev
 == Case Studies ==

Difference between revisions of "Talk:INTERNETARCHIVE.BAK"

Revision as of 06:36, 6 March 2015

Contents

A note on the end-user drives

Potential solutions to the storage problem

Tahoe-LAFS

git-annex

Other

See Also

Other anticipated problems

Project Lab and Corner

Rough Count

Case Studies

BOINC

Stack Overflow

Navigation menu

Difference between revisions of "Talk:INTERNETARCHIVE.BAK"

Revision as of 06:36, 6 March 2015

A note on the end-user drives

Potential solutions to the storage problem

Tahoe-LAFS

git-annex

Other

See Also

Other anticipated problems

Project Lab and Corner

Rough Count

Case Studies

BOINC

Stack Overflow

Navigation menu

Search