User:Vitzli

From Archiveteam
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Saved stuff

  1. JBG Travels youtube channel, partial download, 847 videos total: part 1, part 2, part 3.
    Several videos were either marked private or removed at the request of his employer, although they contained only road video.
  2. Encyclopedia Astronautica snapshot (2015-10-22) according to Alive... OR ARE THEY - is on the watchlist
  3. Pole shift survival library — hasn't been updated since 2013, was quite popular among survival/prepping folks, not endangered as website is still online, but torrent is decaying.
  4. Amazon reviews webdata 1995-2013 — still available, but links were hidden.
  5. CGP Grey youtube channel, tar archive per year: 2010,2011, 2012, 2013, 2014, 2015
  6. SmarterEveryDay youtube channel, tar archive per year: 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

Prospecting IA.BAK collections

Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in for each fashion.

archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools

2016-02-03 census

  • 10 shards
  • 79 collections
  • 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)

jq code

Remove 'collection' items:

parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'" ' > '"$F_PREFIX"'/{}.items.json'

Remove 'uploader' field:

parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'