Difference between revisions of "Lytro"

From Archiveteam
Jump to navigation Jump to search
(Details about preserving additional files from the Lytro hosting shutdown which are still present on their CDN)
 
m (Adding Infobox for cleanup)
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Infobox project
| title = Lytro
| image = Lytroripfeat-800x420.jpg
| description = Proprietary image hosting service
| URL = http://pictures.lytro.com
| project_status = {{Closed}}
| archiving_status = {{partiallysaved}}
| lead = Vitorio
}}
== Result ==
After testing to ensure the archived content would play back correctly, ~200GB of JSON and image files were recovered from Lytro's CDN and provided as WARCs to the Internet Archive.  They were ingested early 2018, and Lytro "living picture" JavaScript embeds captured in the Wayback Machine prior to November 30, 2017, should be viewable once again.
(Reverse-engineering and crawling of the entire site was not attempted, just the restoration of already-partially-archived content.)
== Background ==
Lytro manufactured "light field" cameras, and offered free hosting for the exported "living picture" images on their pictures.lytro.com service.  They recently discontinued the hosting service, breaking all embeds.  See e.g. https://www.theverge.com/2017/12/6/16742314/lytro-focus-photos-support-cameras-illum
Lytro manufactured "light field" cameras, and offered free hosting for the exported "living picture" images on their pictures.lytro.com service.  They recently discontinued the hosting service, breaking all embeds.  See e.g. https://www.theverge.com/2017/12/6/16742314/lytro-focus-photos-support-cameras-illum


Line 7: Line 25:
The Internet Archive Wayback Machine captured many Lytro embeds and galleries, but none of these currently work.  The embedded Lytro web player references several JSON files, which IA captured, but did not parse, so all the URLs referenced in the JSON were not retrieved.
The Internet Archive Wayback Machine captured many Lytro embeds and galleries, but none of these currently work.  The embedded Lytro web player references several JSON files, which IA captured, but did not parse, so all the URLs referenced in the JSON were not retrieved.


https://archive.org/details/lytro-hosted-partial-missing-files contains three sets of WARCs which capture many of the missing files.
== WARCs ==


The <nowiki>lfes-not-in-ia-4.txt</nowiki> file in that item contains a list of ~1.2M URLs which are the image assets referenced in the JSON files.
* https://archive.org/details/lytro-hosted-partial-missing-files
* https://archive.org/details/lytro-hosted-partial-missing-files-1
* https://archive.org/details/lytro-hosted-partial-missing-files-2
* https://archive.org/details/lytro-hosted-partial-missing-files-3


----
~200GB of JSON and image files which should allow many of the Wayback Machine's captured Lytro embeds to work again.


Worklog follows:
The first link should have its mediatype changed to "web".
 
== Details ==
 
https://archive.org/details/lytro-hosted-partial-missing-files contains three sets of WARCs which capture many of the missing files.  (The WARC files were captured using wget 1.19, which puts brackets around WARC-Target-URI headers.  These WARCs were rewritten using <code>warcio</code> to remove those brackets.  The original WARCs are in the <code>BracketedWARCTargetURI</code> folder.)
 
The <code>lfes-not-in-ia-4.txt</code> file in that item contains a list of ~1.2M URLs which are the image assets referenced in the JSON files.
 
https://archive.org/details/lytro-hosted-partial-missing-files-1 through -3 contains the captures of the ~1.2M URLs, which are supporting image files.
 
== Worklog ==


pictures.lytro.com and lfe-cdn.lytro.com were downloaded from the Wayback Machine based on the date prior to the shutdown:
pictures.lytro.com and lfe-cdn.lytro.com were downloaded from the Wayback Machine based on the date prior to the shutdown:
Line 31: Line 62:


  <nowiki>
  <nowiki>
>>> with open('lfe-urls-in-pictures.txt', 'r') as f:
...    lfes = f.readlines()
...
>>> len(lfes)
47045
>>> fresh = []
>>> for a in lfes:
...    b = a.strip("\"'()\n")
...    if not os.path.exists(b.split('https://')[1]):
...            fresh.append(b)
...
>>> len(fresh)
39798
>>> with open('lfes-not-in-ia-1.txt', 'w') as f:
...    f.write('\n'.join(fresh))
...
>>>
$ ~/bin/wget --ca-certificate=$HOME/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-1 --warc-cdx --wait=1 --random-wait -i lfes-not-in-ia-1.txt  
$ ~/bin/wget --ca-certificate=$HOME/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-1 --warc-cdx --wait=1 --random-wait -i lfes-not-in-ia-1.txt  
</nowiki>
</nowiki>


That list wasn't reduplicated, there were ~10k duplicate URLs out of ~40k.  Also, a lot of 500 errors off the CDN, mostly for URLs without a "v2" in the URL.  e.g. these fail:
That list wasn't deduplicated, there were ~10k duplicate URLs out of ~40k.  Also, a lot of 500 errors off the CDN, mostly for URLs without a "v2" in the URL.  e.g. these fail:


  <nowiki>
  <nowiki>
Line 62: Line 111:


  <nowiki>
  <nowiki>
>>> with open('lfe-urls-in-pictures.txt', 'r') as f:
...    lfes = f.readlines()
...
>>> len(lfes)
47045
>>> v2s = []
>>> uuids = []
>>> for a in lfes:
...    b = a.strip("\"'()\n").split('/')
...    if len(b[4]) == 36:
...            uuids.append(b[4])
...    elif b[4] == 'announce':
...            uuids.append(b[5])
...    elif b[4] == 'players':
...            pass
...    else:
...            print b
...
>>> len(uuids)
47007
>>> uuids = set(uuids)
>>> len(uuids)
10866
>>> import os
>>> for a in uuids:
...    for b in ['output.html5.json', 'output.html5_normal.json', 'output.html5_small.json', 'output.html5_tiny.json', 'player_preview.jpeg']:
...            if not os.path.exists('lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b)):
...                    v2s.append('https://lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b))
...
>>> len(v2s)
30438
>>> len(set(v2s))
30438
>>>
$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-2 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-2.txt
$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-2 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-2.txt
</nowiki>
</nowiki>
Line 72: Line 156:


  <nowiki>
  <nowiki>
>>> assets = os.listdir('lfe-cdn.lytro.com/assets')
>>> lfe = os.listdir('lfe-cdn.lytro.com/lfe')
>>> len(lfe)
10896
>>> len(assets)
34
>>> uuids = assets + lfe
>>> len(uuids)
10930
>>> uuids = [x for x in uuids if len(x) == 36]
>>> len(uuids)
10927
>>> uuids = set(uuids)
>>> len(uuids)
10894
>>> v2s = []
>>> for a in uuids:
...    for b in ['output.html5.json', 'output.html5_normal.json', 'output.html5_small.json', 'output.html5_tiny.json', 'player_preview.jpeg']:
...            if not os.path.exists('lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b)):
...                    v2s.append('https://lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b))
...
>>> len(v2s)
165
>>> with open('lfes-not-in-ia-3.txt', 'w') as f:
...    f.write('\n'.join(v2s))
...
>>>
$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-3 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-3.txt
$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-3 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-3.txt
</nowiki>
</nowiki>
Line 105: Line 217:


The filenames generated by Lytro's processes seem to be fairly unique per embed, so no point in generating URLs not represented in the JSON files.
The filenames generated by Lytro's processes seem to be fairly unique per embed, so no point in generating URLs not represented in the JSON files.
Parse all the JSON files and generate the list of image dependencies:
<nowiki>
>>> import os, nested_lookup, json, gzip
>>> urls = []
>>> assets = []
>>> for root, dirs, files in os.walk('lfe-cdn.lytro.com/lfe'):
...    for a in files:
...            if os.path.splitext(a)[1] == '.json':
...                    try:
...                            with open(os.path.join(root, a), 'r') as f:
...                                    j = json.load(f)
...                    except:
...                            print os.path.join(root, a)
...                            try:
...                                    with gzip.open(os.path.join(root, a), 'rb') as f:
...                                            j = json.load(f)
...                            except:
...                                    print 'not gzip', os.path.join(root, a)
...                                    continue
...                    imgs = nested_lookup.nested_lookup('imageUrl', j)
...                    for i in imgs:
...                            assets.append(i)
...                            urls.append('https://{}'.format(os.path.join(root, i)))
...
>>> len(urls)
1285432
>>> len(set(urls))
1285432
>>> len(assets)
1285432
>>> len(set(assets))
275454
</nowiki>
wget 1.19 writes WARC-Target-URI headers with brackets around the URL, breaking some AT/IA/WB software.  Rewrite these headers using <code>warcio</code>.
<nowiki>
>>> from warcio.archiveiterator import ArchiveIterator
>>> from warcio.warcwriter import WARCWriter
>>> output = open('lfes-not-in-ia-1.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-1.warc.gz', 'rb') as stream:
...    for record in ArchiveIterator(stream):
...            if 'WARC-Target-URI' in record.rec_headers:                   
...                    record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...            writer.write_record(record)                                   
...
>>> output.close()
>>> output = open('lfes-not-in-ia-2-00000.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-2-00000.warc.gz', 'rb') as stream:
...    for record in ArchiveIterator(stream):
...            if 'WARC-Target-URI' in record.rec_headers:                   
...                    record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...            writer.write_record(record)
...
>>> output.close()
>>> output = open('lfes-not-in-ia-2-meta.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-2-meta.warc.gz', 'rb') as stream:
...    for record in ArchiveIterator(stream):
...            if 'WARC-Target-URI' in record.rec_headers:
...                    record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...            writer.write_record(record)
...
>>> output.close()
>>> output = open('lfes-not-in-ia-3-00000.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-3-00000.warc.gz', 'rb') as stream:
...    for record in ArchiveIterator(stream):
...            if 'WARC-Target-URI' in record.rec_headers:
...                    record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...            writer.write_record(record)
...
>>> output.close()
>>> output = open('lfes-not-in-ia-3-meta.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-3-meta.warc.gz', 'rb') as stream:
...    for record in ArchiveIterator(stream):
...            if 'WARC-Target-URI' in record.rec_headers:                   
...                    record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...            writer.write_record(record)                                   
...
>>> output.close()
>>> ^D
</nowiki>

Revision as of 19:29, 3 November 2018

Lytro
Proprietary image hosting service
Proprietary image hosting service
URL http://pictures.lytro.com
Status Offline
Archiving status Partially saved
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)
Project lead Vitorio

Result

After testing to ensure the archived content would play back correctly, ~200GB of JSON and image files were recovered from Lytro's CDN and provided as WARCs to the Internet Archive. They were ingested early 2018, and Lytro "living picture" JavaScript embeds captured in the Wayback Machine prior to November 30, 2017, should be viewable once again.

(Reverse-engineering and crawling of the entire site was not attempted, just the restoration of already-partially-archived content.)

Background

Lytro manufactured "light field" cameras, and offered free hosting for the exported "living picture" images on their pictures.lytro.com service. They recently discontinued the hosting service, breaking all embeds. See e.g. https://www.theverge.com/2017/12/6/16742314/lytro-focus-photos-support-cameras-illum

For an example of broken embeds, see e.g. https://www.theverge.com/2014/7/30/5949913/lytro-illum-review

Lytro had announced at one point plans to open-source their viewer, but that never happened: https://www.lytro.com/press/releases/lytro-unleashes-interactive-power-of-living-pictures-to-the-web-with-new-lytro-webgl-player

The Internet Archive Wayback Machine captured many Lytro embeds and galleries, but none of these currently work. The embedded Lytro web player references several JSON files, which IA captured, but did not parse, so all the URLs referenced in the JSON were not retrieved.

WARCs

~200GB of JSON and image files which should allow many of the Wayback Machine's captured Lytro embeds to work again.

The first link should have its mediatype changed to "web".

Details

https://archive.org/details/lytro-hosted-partial-missing-files contains three sets of WARCs which capture many of the missing files. (The WARC files were captured using wget 1.19, which puts brackets around WARC-Target-URI headers. These WARCs were rewritten using warcio to remove those brackets. The original WARCs are in the BracketedWARCTargetURI folder.)

The lfes-not-in-ia-4.txt file in that item contains a list of ~1.2M URLs which are the image assets referenced in the JSON files.

https://archive.org/details/lytro-hosted-partial-missing-files-1 through -3 contains the captures of the ~1.2M URLs, which are supporting image files.

Worklog

pictures.lytro.com and lfe-cdn.lytro.com were downloaded from the Wayback Machine based on the date prior to the shutdown:

$ ~/.gem/ruby/2.0.0/bin/wayback_machine_downloader https://lfe-cdn.lytro.com/ -t 20171129
$ ~/.gem/ruby/2.0.0/bin/wayback_machine_downloader https://pictures.lytro.com/ -t 20171129

A regex search inside of all of the pictures.lytro.com files to find lfe-cdn references:

["'\(]https?:\/\/lfe-cdn\.lytro\.com.*?["'\)]

Downloaded all the URLs from that list which weren't already present:

>>> with open('lfe-urls-in-pictures.txt', 'r') as f:
...     lfes = f.readlines()
... 
>>> len(lfes)
47045
>>> fresh = []
>>> for a in lfes:
...     b = a.strip("\"'()\n")
...     if not os.path.exists(b.split('https://')[1]):
...             fresh.append(b)
... 
>>> len(fresh)
39798
>>> with open('lfes-not-in-ia-1.txt', 'w') as f:
...     f.write('\n'.join(fresh))
... 
>>> 

$ ~/bin/wget --ca-certificate=$HOME/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-1 --warc-cdx --wait=1 --random-wait -i lfes-not-in-ia-1.txt 

That list wasn't deduplicated, there were ~10k duplicate URLs out of ~40k. Also, a lot of 500 errors off the CDN, mostly for URLs without a "v2" in the URL. e.g. these fail:

https://lfe-cdn.lytro.com/lfe/2020944a-25a4-11e3-abd1-22000a8914f9/carousel_preview.jpg
https://lfe-cdn.lytro.com/lfe/2516a14c-25a4-11e3-9166-1231393ff52e/output.html5.json

but if I rewrite the second one to:

https://lfe-cdn.lytro.com/lfe/2516a14c-25a4-11e3-9166-1231393ff52e/v2/output.html5.json

it exists.

The v2 (I guess) player is their "new" WebGL-based one, which asks for these paths in the JS, e.g.:

LYT.PICTURE_ORIGINAL_URL = "https://lfe-cdn.lytro.com/lfe/8112c254-17f8-11e4-8fa7-22000a0d84a4/v2/output.html5.json";
LYT.PICTURE_NORMAL_URL = "https://lfe-cdn.lytro.com/lfe/8112c254-17f8-11e4-8fa7-22000a0d84a4/v2/output.html5_normal.json";
LYT.PICTURE_SMALL_URL = "https://lfe-cdn.lytro.com/lfe/8112c254-17f8-11e4-8fa7-22000a0d84a4/v2/output.html5_small.json";
LYT.PICTURE_TINY_URL = "https://lfe-cdn.lytro.com/lfe/8112c254-17f8-11e4-8fa7-22000a0d84a4/v2/output.html5_tiny.json";
LYT.PREVIEW_URL = "https://lfe-cdn.lytro.com/lfe/8112c254-17f8-11e4-8fa7-22000a0d84a4/v2/player_preview.jpeg";

Went through all the URLs and pulled out all the UUID keys and see if there are those additional paths still to be fetched.

>>> with open('lfe-urls-in-pictures.txt', 'r') as f:
...     lfes = f.readlines()
... 
>>> len(lfes)
47045
>>> v2s = []
>>> uuids = []
>>> for a in lfes:
...     b = a.strip("\"'()\n").split('/')
...     if len(b[4]) == 36:
...             uuids.append(b[4])
...     elif b[4] == 'announce':
...             uuids.append(b[5])
...     elif b[4] == 'players':
...             pass
...     else:
...             print b
... 
>>> len(uuids)
47007
>>> uuids = set(uuids)
>>> len(uuids)
10866
>>> import os
>>> for a in uuids:
...     for b in ['output.html5.json', 'output.html5_normal.json', 'output.html5_small.json', 'output.html5_tiny.json', 'player_preview.jpeg']:
...             if not os.path.exists('lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b)):
...                     v2s.append('https://lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b))
... 
>>> len(v2s)
30438
>>> len(set(v2s))
30438
>>> 

$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-2 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-2.txt

Could also back up schema.lytro.com, an S3 bucket which stores all the JSON schemas for the JSON files.

Note that the JSON files include an `asset_base` URL reference to lfe-cdn.lytro.com, but that isn't actually part of the JSON schema, and it doesn't appear to be checked by the last version of the player JS, so no need to rewrite it.

Now that we have all the UUIDs, let's check all the folders in the CDN directory to make sure we haven't missed any JSON downloads.

>>> assets = os.listdir('lfe-cdn.lytro.com/assets')
>>> lfe = os.listdir('lfe-cdn.lytro.com/lfe')
>>> len(lfe)
10896
>>> len(assets)
34
>>> uuids = assets + lfe
>>> len(uuids)
10930
>>> uuids = [x for x in uuids if len(x) == 36]
>>> len(uuids)
10927
>>> uuids = set(uuids)
>>> len(uuids)
10894
>>> v2s = []
>>> for a in uuids:
...     for b in ['output.html5.json', 'output.html5_normal.json', 'output.html5_small.json', 'output.html5_tiny.json', 'player_preview.jpeg']:
...             if not os.path.exists('lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b)):
...                     v2s.append('https://lfe-cdn.lytro.com/lfe/{}/v2/{}'.format(a, b))
... 
>>> len(v2s)
165
>>> with open('lfes-not-in-ia-3.txt', 'w') as f:
...     f.write('\n'.join(v2s))
... 
>>>

$ ~/bin/wget --ca-certificate=/Users/vitorio/Downloads/curl-7.57.0/lib/ca-bundle.crt -x --warc-file=lfes-not-in-ia-3 --warc-cdx --warc-max-size=1G --wait=1 --random-wait -i lfes-not-in-ia-3.txt

At least some of the JSON files were served by IA erroneously without being gzip decompressed:

lfe-cdn.lytro.com/lfe/02c39bca-1146-11e4-a2b8-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/0f16b1d2-1128-11e4-9af7-22000a8a0b0d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/175c8c20-1206-11e4-970d-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/1e4f739e-1206-11e4-970d-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/28ad3d90-112e-11e4-be98-22000a2d8f66/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/36066900-1131-11e4-8bd7-22000a2d8f66/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/493f6fc8-1201-11e4-bfe5-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/8a4b167c-1132-11e4-95dc-22000a2d8f66/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/8bb527ea-1133-11e4-8bd7-22000a2d8f66/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/90a1a35a-1200-11e4-9a06-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/90d88df2-1200-11e4-bfe5-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/91427488-1200-11e4-a093-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/91f617fe-1133-11e4-9af7-22000a8a0b0d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/926eb8c0-1139-11e4-84d3-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/92bb1d64-1139-11e4-b404-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/9664811e-1204-11e4-9573-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/96a5e7ee-1204-11e4-bfe5-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/970dea7e-1204-11e4-b34f-22000a2b9e9d/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/9728bc6e-1204-11e4-b8c1-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/9b00b186-112b-11e4-966b-22000a2d8f66/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/a1ea6af6-11fc-11e4-b258-22000ab80aeb/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/db00af6a-0ed9-11e4-8a95-22000a4184f0/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/f1c74502-0ed8-11e4-967c-12313b04e812/v2/output.html5_small.json
lfe-cdn.lytro.com/lfe/f2509528-0ed8-11e4-8023-12313b04e812/v2/output.html5_small.json

The filenames generated by Lytro's processes seem to be fairly unique per embed, so no point in generating URLs not represented in the JSON files.

Parse all the JSON files and generate the list of image dependencies:

>>> import os, nested_lookup, json, gzip
>>> urls = []
>>> assets = []
>>> for root, dirs, files in os.walk('lfe-cdn.lytro.com/lfe'):
...     for a in files:
...             if os.path.splitext(a)[1] == '.json':
...                     try:
...                             with open(os.path.join(root, a), 'r') as f:
...                                     j = json.load(f)
...                     except:
...                             print os.path.join(root, a)
...                             try:
...                                     with gzip.open(os.path.join(root, a), 'rb') as f:
...                                             j = json.load(f)
...                             except:
...                                     print 'not gzip', os.path.join(root, a)
...                                     continue
...                     imgs = nested_lookup.nested_lookup('imageUrl', j)
...                     for i in imgs:
...                             assets.append(i)
...                             urls.append('https://{}'.format(os.path.join(root, i)))
... 
>>> len(urls)
1285432
>>> len(set(urls))
1285432
>>> len(assets)
1285432
>>> len(set(assets))
275454

wget 1.19 writes WARC-Target-URI headers with brackets around the URL, breaking some AT/IA/WB software. Rewrite these headers using warcio.

>>> from warcio.archiveiterator import ArchiveIterator
>>> from warcio.warcwriter import WARCWriter
>>> output = open('lfes-not-in-ia-1.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-1.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:                     
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)                                     
... 
>>> output.close()
>>> output = open('lfes-not-in-ia-2-00000.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-2-00000.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:                     
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)
... 
>>> output.close()
>>> output = open('lfes-not-in-ia-2-meta.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-2-meta.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)
... 
>>> output.close()
>>> output = open('lfes-not-in-ia-3-00000.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-3-00000.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)
... 
>>> output.close()
>>> output = open('lfes-not-in-ia-3-meta.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-3-meta.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:                     
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)                                     
... 
>>> output.close()
>>> ^D