Difference between revisions of "Internet Archive Census"

From Archiveteam
Jump to navigation Jump to search
(updated more of the page with the 2nd census)
m (typos fixed: March of 2015 → March 2015, english → English, a archive → an archive)
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
The '''Internet Archive Censuses''' are unofficial attempts to count and account for the files available on the Internet Archive, both directly downloadable, public files and private files that are available through interfaces like the Wayback Machine or the TV News Archive. The purpose of this project is multi-fold, including collections of the reported hashes of all the files, determination of sizes of various collections, and determining priorities in backing up portions of the Internet Archive's data stores.
The '''Internet Archive Censuses''' are unofficial attempts to count and account for the files available on the Internet Archive, both directly downloadable, public files and private files that are available through interfaces like the Wayback Machine or the TV News Archive. The purpose of this project is multi-fold, including collections of the reported hashes of all the files, determination of sizes of various collections, and determining priorities in backing up portions of the Internet Archive's data stores.


The first Census was conducted in March of 2015. Its results are on the Archive at {{IA id|ia-bak-census_20150304}}.
The first Census was conducted in March 2015. Its results are on the Archive at {{IA id|ia-bak-census_20150304}}.


A re-check of the items in the first census was done in January 2016. The results are on IA under {{IA id|IACensusData}}.
A re-check of the items in the first census was done in January 2016. The results are on IA under {{IA id|IACensusData}}.
A third census was done in April 2016, {{IA id|ia_census_201604}}, based on an updated list of identifiers, and including the sha1 hashes as well as the md5s.


== Purpose of the Census ==
== Purpose of the Census ==
Line 13: Line 15:
== Contents of the Census ==
== Contents of the Census ==


Each Census is a very large collection of JSON-formatted records, consisting of a subset of the metadata of each item in the archive. The metadata is downloaded with Jake Johnson's [https://github.com/jjjake/iamine ia-mine utility], then processed by with the '''jq''' tool. Like all such projects, ''the data should not be considered perfect'', although a large percentage should accurately reflect the site. Since there have been two censuses (with a third in process), some limited comparisons of growth or file change can be made. (There are also previous reports of total files or other activity, but none to the level of the JSON format material the Censuses provide).
Each Census is a very large collection of JSON-formatted records, consisting of a subset of the metadata of each item in the archive. The metadata is downloaded with Jake Johnson's [https://github.com/jjjake/iamine ia-mine utility], then processed by with the '''jq''' tool. Like all such projects, ''the data should not be considered perfect'', although a large percentage should accurately reflect the site. Since there have been three censuses, some limited comparisons of growth or file change can be made. (There are also previous reports of total files or other activity, but none to the level of the JSON format material the Censuses provide).


The two censuses so far have used the same itemlist: {{IA file|ia-bak-census_20150304|metamgr-norm-ids-20150304205357.txt.gz}}(135.7M compressed; 372M uncompressed). It contains 14,926,080 item identifiers (including exactly one duplicate, https://archive.org/details/e-dv212_boston_14_harvardsquare_09-05_001.ogg for some bizarre reason). The 2nd census made a sorted (and uncompressed) version of it available: {{IA file|IACensusData|metamgr-norm-ids-20150304205357_sorted.txt}}.
The first two censuses used the same itemlist: {{IA file|ia-bak-census_20150304|metamgr-norm-ids-20150304205357.txt.gz}}(135.7M compressed; 372M uncompressed). It contains 14,926,080 item identifiers (including exactly one duplicate, https://archive.org/details/e-dv212_boston_14_harvardsquare_09-05_001.ogg for some bizarre reason). The 2nd census made a sorted (and uncompressed) version of it available: {{IA file|IACensusData|metamgr-norm-ids-20150304205357_sorted.txt}}. The 3rd census used an updated itemlist: {{IA file|ia_census_201604|identifier_list_20160411221100}}(486.9M uncompressed), which is sorted and contains 19,134,984 item identifiers.  


The main data file for the first census is {{IA file|ia-bak-census_20150304|public-file-size-md_20150304205357.json.gz}} (6073671780 bytes (5.7G) compressed; 22522862598 bytes (21G) uncompressed). It contains one item without any identifier at all, which from the file names, appears to be {{IA item|lecture_10195}} (which had its _meta.xml file re-created soon after the census was run). Oddly, it contains only 13,075,195 normal string identifiers, with 113 duplicates.
The main data file for the first census is {{IA file|ia-bak-census_20150304|public-file-size-md_20150304205357.json.gz}} (6073671780 bytes (5.7G) compressed; 22522862598 bytes (21G) uncompressed). It contains one item without any identifier at all, which from the file names, appears to be {{IA id|lecture_10195}} (which had its _meta.xml file re-created soon after the census was run). Oddly, it contains only 13,075,195 normal string identifiers, with 113 duplicates.


The main data file for the second census is {{IA file|IACensusData|file-size-md_20150304205357_recheck_20160120112813.json.gz}} (8796507585 bytes (9G) compressed; 35966005026 bytes (36G) uncompressed). An additional data file of items re-grabbed after failing the first time is: {{IA file|IACensusData|file-size-md_20150304205357_recheck_leftovers_20160225231428.json}} (10486657 bytes (11M)).
The main data file for the second census is {{IA file|IACensusData|file-size-md_20150304205357_recheck_20160120112813.json.gz}} (8796507585 bytes (9G) compressed; 35966005026 bytes (36G) uncompressed). An additional data file of items re-grabbed after failing the first time is: {{IA file|IACensusData|file-size-md_20150304205357_recheck_leftovers_20160225231428.json}} (10486657 bytes (11M)).


The second census also includes files listing only the identifiers, file paths and md5 hashes from all 3 data files (the 2 from the second census, and the one from the first census): {{IA file|IACensusData|file-hashes_20150304205357_recheck_20160120112813.tsv.gz}}(7218470882 bytes (7G) compressed), {{IA file|IACensusData|file-hashes_20150304205357_recheck_leftovers_20160225231428.tsv}}(7066587 bytes (7M)), {{IA file|IACensusData|public-file-hashes_20150304205357_unsorted.tsv.gz}}(4791601837 bytes (5G) compressed)
The main data file for the third census is {{IA file|ia_census_201604|census_data_20160411221100_public.json.gz}}(11219518595 bytes (10.4 G) compressed). It contains only metadata about items all of whose files can be downloaded without restriction. There is also a data file {{IA file|ia_census_201604|census_data_20160411221100_private.json.gz}}(5993343635 bytes (5.6 G) compressed) containing metadata for the other items from the itemlist for which metadata was available.
 
The second and third censuses also includes tab-separated-value files listing only the identifiers, file paths and md5 hashes from the main data files (2 from the 3rd census, 2 from the second, and one from the first census's main data file (included in the second census's item for historical reasons)):  
* In the 2nd census item:
** {{IA file|IACensusData|file-hashes_20150304205357_recheck_20160120112813.tsv.gz}}(7218470882 bytes (7G) compressed)
** {{IA file|IACensusData|file-hashes_20150304205357_recheck_leftovers_20160225231428.tsv}}(7066587 bytes (7M))
** {{IA file|IACensusData|public-file-hashes_20150304205357_unsorted.tsv.gz}}(4791601837 bytes (5G) compressed)
* In the 3rd census item:
** {{IA file|ia_census_201604|file_hashes_md5_20160411221100_public.tsv.gz}}(5212673170 bytes (4.9 G) compressed)
** {{IA file|ia_census_201604|file_hashes_md5_20160411221100_private.tsv.gz}}(2690158285 bytes (2.5 G) compressed)
 
The third census also includes similar files for the sha1 hashes: {{IA file|ia_census_201604|file_hashes_sha1_20160411221100_public.tsv.gz}}(6120162493 bytes (5.7 G) compressed), {{IA file|ia_census_201604|file_hashes_sha1_20160411221100_private.tsv.gz}}(3202141568 bytes (3 G) compressed)
 
The retrieved itemlist from the original census {{IA file|ia-bak-census_20150304|all-ids-got-sorted.txt.gz}}(91215211 bytes (87M) compressed; 389853688 bytes (372M) uncompressed) contains 14,921,581 item identifiers, with no duplicates.


The retrieved itemlist [https://archive.org/download/ia-bak-census_20150304/all-ids-got-sorted.txt.gz all-ids-got-sorted.txt.gz] (91215211 bytes (87M) compressed; 389853688 bytes (372M) uncompressed) contains 14,921,581 item identifiers, with no duplicates.
The un-retrieved itemlist from the original census {{IA file|ia-bak-census_20150304|unretrievable-items.txt}}(141247 bytes) contains 4,508 items, with no duplicates.


The un-retrieved itemlist [https://archive.org/download/ia-bak-census_20150304/unretrievable-items.txt unretrievable-items.txt] (141247 bytes) contains 4,508 items, with no duplicates.
The second census also includes three auxiliary lists of identifiers: {{IA file|IACensusData|recheck_leftovers_20160225231428_identifiers}} (71052 bytes (70K)), the 2,886 identifiers in the leftovers re-grab; {{IA file|IACensusData|recheck_identifiers_dark}} (7875713 bytes (8M)), 256,352 previously-used (i.e. "dark") identifiers, and {{IA file|IACensusData|deleted_identifiers}} (1566 bytes (2K)) 68 identifiers only identifiable as having been previously used by looking at the /history/ page.
 
The third census also includes 6 scripts written during the process of generating it, and which may be useful in future ones:
*{{IA file|ia_census_201604|do_census.sh}}
*{{IA file|ia_census_201604|do_census_without_ia_mine.sh}}
*{{IA file|ia_census_201604|extract_file_hashes.sh}}
*{{IA file|ia_census_201604|make_census_data.jq}}
*{{IA file|ia_census_201604|make_census_data_v2.jq}}
*{{IA file|ia_census_201604|make_file_hashes.jq}}


== Some Relevant Information from the Census ==
== Some Relevant Information from the Census ==
Line 33: Line 56:
* The size of the listed data is 14.23 petabytes.  
* The size of the listed data is 14.23 petabytes.  
* The census only contains "original" data, not derivations created by the system. (For example, if a .AVI file is uploaded, the census only counts the .AVI, and not a .MP4 or .GIF derived from the original file).
* The census only contains "original" data, not derivations created by the system. (For example, if a .AVI file is uploaded, the census only counts the .AVI, and not a .MP4 or .GIF derived from the original file).
* The vast majority of the data is compressed in some way. By far the largest kind of file is gzip, with 9PB uploaded! Most files that are not in a archive format are compressed videos, music, pictures etc.
* The vast majority of the data is compressed in some way. By far the largest kind of file is gzip, with 9PB uploaded! Most files that are not in an archive format are compressed videos, music, pictures etc.
* The largest single file (that is not just a tar of other files) is TELSEY_004.MOV (449GB), in item [https://archive.org/details/TELSEY_004 TELSEY_004] in the [https://archive.org/details/xfrstn xfrstn] collection.
* The largest single file (that is not just a tar of other files) is TELSEY_004.MOV (449GB), in item [https://archive.org/details/TELSEY_004 TELSEY_004] in the [https://archive.org/details/xfrstn xfrstn] collection.
* There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
* There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
Line 49: Line 72:


jq --raw-output '(.collection[0]? // .collection) as $coll | (.id[0]? // .id) as $id | .files[] | "\(.md5)\t\(.size)\t\($coll)\thttps://archive.org/download/\($id)/\(.name)"'
jq --raw-output '(.collection[0]? // .collection) as $coll | (.id[0]? // .id) as $id | .files[] | "\(.md5)\t\(.size)\t\($coll)\thttps://archive.org/download/\($id)/\(.name)"'
== See Also ==
* {{IA item|engbooksall}} - identifiers of all English language texts with djvutext files as of 4/11/11 (32 MB uncompressed text file)
* {{IA item|text_identifiers}} - multiple files of text identifiers per year, generated 12/11/2009
* {{IA item|IA_book_ids}} - 7 MB list of identifiers of books, generated 2008
* {{IA item|archiveteam_census_2016}} and {{IA item|archiveteam_census_2017}} - monthly lists of all the (searchable) identifiers on IA.
** [https://gitlab.com/bwn/cron-census-identifiers Source code for the script used to generate them].
* [https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps How OpenTimestamps 'Carbon Dated' (almost) The Entire Internet With One Bitcoin Transaction] (May 25, 2017)


{{Navigation box}}
{{Navigation box}}

Latest revision as of 00:27, 5 December 2017

The Internet Archive Censuses are unofficial attempts to count and account for the files available on the Internet Archive, both directly downloadable, public files and private files that are available through interfaces like the Wayback Machine or the TV News Archive. The purpose of this project is multi-fold, including collections of the reported hashes of all the files, determination of sizes of various collections, and determining priorities in backing up portions of the Internet Archive's data stores.

The first Census was conducted in March 2015. Its results are on the Archive at ia-bak-census_20150304.

A re-check of the items in the first census was done in January 2016. The results are on IA under IACensusData.

A third census was done in April 2016, ia_census_201604, based on an updated list of identifiers, and including the sha1 hashes as well as the md5s.

Purpose of the Census

The original census was called for as a stepping stone in the INTERNETARCHIVE.BAK project, an experiment and project to have Archive Team back up the Internet Archive. While officially, the Internet Archive has 21 petabytes of information in its data stores (as of March 2015), some of that data is related to system overhead, or are stream-only/not available. By having a full run-through of the entire collection of items at the Archive, the next phases of the INTERNETARCHIVE.BAK experiment (testing methodologies) can move forward.

The data is also useful for talking about what the Internet Archive does, and what kinds of items are in the stacks - collections can be found with very large or manageable amounts of data, and audiences/researchers outside the backup experiment can do their own sets of data access and acquisition. Search engines can be experimented with, as well as data visualization.

Contents of the Census

Each Census is a very large collection of JSON-formatted records, consisting of a subset of the metadata of each item in the archive. The metadata is downloaded with Jake Johnson's ia-mine utility, then processed by with the jq tool. Like all such projects, the data should not be considered perfect, although a large percentage should accurately reflect the site. Since there have been three censuses, some limited comparisons of growth or file change can be made. (There are also previous reports of total files or other activity, but none to the level of the JSON format material the Censuses provide).

The first two censuses used the same itemlist: metamgr-norm-ids-20150304205357.txt.gz(in ia-bak-census_20150304)(135.7M compressed; 372M uncompressed). It contains 14,926,080 item identifiers (including exactly one duplicate, https://archive.org/details/e-dv212_boston_14_harvardsquare_09-05_001.ogg for some bizarre reason). The 2nd census made a sorted (and uncompressed) version of it available: metamgr-norm-ids-20150304205357_sorted.txt(in IACensusData). The 3rd census used an updated itemlist: identifier_list_20160411221100(in ia_census_201604)(486.9M uncompressed), which is sorted and contains 19,134,984 item identifiers.

The main data file for the first census is public-file-size-md_20150304205357.json.gz(in ia-bak-census_20150304) (6073671780 bytes (5.7G) compressed; 22522862598 bytes (21G) uncompressed). It contains one item without any identifier at all, which from the file names, appears to be lecture_10195 (which had its _meta.xml file re-created soon after the census was run). Oddly, it contains only 13,075,195 normal string identifiers, with 113 duplicates.

The main data file for the second census is file-size-md_20150304205357_recheck_20160120112813.json.gz(in IACensusData) (8796507585 bytes (9G) compressed; 35966005026 bytes (36G) uncompressed). An additional data file of items re-grabbed after failing the first time is: file-size-md_20150304205357_recheck_leftovers_20160225231428.json(in IACensusData) (10486657 bytes (11M)).

The main data file for the third census is census_data_20160411221100_public.json.gz(in ia_census_201604)(11219518595 bytes (10.4 G) compressed). It contains only metadata about items all of whose files can be downloaded without restriction. There is also a data file census_data_20160411221100_private.json.gz(in ia_census_201604)(5993343635 bytes (5.6 G) compressed) containing metadata for the other items from the itemlist for which metadata was available.

The second and third censuses also includes tab-separated-value files listing only the identifiers, file paths and md5 hashes from the main data files (2 from the 3rd census, 2 from the second, and one from the first census's main data file (included in the second census's item for historical reasons)):

The third census also includes similar files for the sha1 hashes: file_hashes_sha1_20160411221100_public.tsv.gz(in ia_census_201604)(6120162493 bytes (5.7 G) compressed), file_hashes_sha1_20160411221100_private.tsv.gz(in ia_census_201604)(3202141568 bytes (3 G) compressed)

The retrieved itemlist from the original census all-ids-got-sorted.txt.gz(in ia-bak-census_20150304)(91215211 bytes (87M) compressed; 389853688 bytes (372M) uncompressed) contains 14,921,581 item identifiers, with no duplicates.

The un-retrieved itemlist from the original census unretrievable-items.txt(in ia-bak-census_20150304)(141247 bytes) contains 4,508 items, with no duplicates.

The second census also includes three auxiliary lists of identifiers: recheck_leftovers_20160225231428_identifiers(in IACensusData) (71052 bytes (70K)), the 2,886 identifiers in the leftovers re-grab; recheck_identifiers_dark(in IACensusData) (7875713 bytes (8M)), 256,352 previously-used (i.e. "dark") identifiers, and deleted_identifiers(in IACensusData) (1566 bytes (2K)) 68 identifiers only identifiable as having been previously used by looking at the /history/ page.

The third census also includes 6 scripts written during the process of generating it, and which may be useful in future ones:

Some Relevant Information from the Census

Based on the output of the Census:

  • The size of the listed data is 14.23 petabytes.
  • The census only contains "original" data, not derivations created by the system. (For example, if a .AVI file is uploaded, the census only counts the .AVI, and not a .MP4 or .GIF derived from the original file).
  • The vast majority of the data is compressed in some way. By far the largest kind of file is gzip, with 9PB uploaded! Most files that are not in an archive format are compressed videos, music, pictures etc.
  • The largest single file (that is not just a tar of other files) is TELSEY_004.MOV (449GB), in item TELSEY_004 in the xfrstn collection.
  • There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
  • The largest duplicated file is all-20150219205226/part-0235.cdx.gz (195GB) in item wbsrv-0235-1. The entire wbsrv-0235-1 item is a duplicate of wbsrv-0235-0, that's 600GB. This is intentional, as these items are part of the waybackcdx collection, used to re-check already archived URLs in the Wayback Machine, and the whole index is duplicated, to decrease risk of loss.

Extracting data

As hinted by the IA guys, the jq tool is well-suited to working with the census.

Here is a command line that will generate a file containing "md5 size collection url" format lines for everything in the census:

zcat public-file-size-md_20150304205357.json.gz | ./jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | "\(.md5)\t\(.size)\t\($coll)\thttps://archive.org/download/\($id)/\(.name)"' > md5_collection_url.txt

Some files are in multiple collections, and even in multiple items. The above command line generates all the permutations in those cases, and so outputs 296 million lines. Here is a varient that picks a single item and collection when a file is in multiple ones; it outputs 177 million lines.

jq --raw-output '(.collection[0]? // .collection) as $coll | (.id[0]? // .id) as $id | .files[] | "\(.md5)\t\(.size)\t\($coll)\thttps://archive.org/download/\($id)/\(.name)"'

See Also