Difference between revisions of "The WARC Ecosystem"
Jump to navigation
Jump to search
(Created page with "Everything about the WARC format and the tools that support it. == Tools == * name 1 license 2 programming language 3 test suite 4 # of authors 5 description *...") |
(→Tools) |
||
Line 12: | Line 12: | ||
5 description | 5 description | ||
* wget v1.14+ | |||
* GPL v3+ | |||
* C | |||
* Has a test suite but does not test any warc functionality | |||
* 2+ according to the changelog | |||
* A non-interactive network downloader. | |||
* https://github.com/internetarchive/warc | * https://github.com/internetarchive/warc | ||
Line 71: | Line 77: | ||
* http://archive.org/web/researcher/cdx_legend.php | * http://archive.org/web/researcher/cdx_legend.php | ||
* cdx from warc - https://github.com/rajbot/CDX-Writer | * cdx from warc - https://github.com/rajbot/CDX-Writer | ||
== Deprecated == | == Deprecated == |
Revision as of 19:50, 12 April 2013
Everything about the WARC format and the tools that support it.
Tools
- name
1 license 2 programming language 3 test suite 4 # of authors 5 description
- wget v1.14+
* GPL v3+ * C * Has a test suite but does not test any warc functionality * 2+ according to the changelog * A non-interactive network downloader.
* GPL v2 * Python * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py * 3 commiters on github * library to work with WARC files - docs http://warc.readthedocs.org/en/latest/
* BSD * python * NO TEST SUITE * 1 author * a simple HTTP proxy that saves all HTTP traffic to a file
* MIT License * python 2.6 * NO TEST SUITE * 4 commiters * warc validator, dump, search, index
* no license information * python * NO TEST SUITE * 1 author * WARC viewer for browsing the contents of a WARC file. - needs a firefox addon installed to work
* no license information * python * NO TEST SUITE * 1 author * Merge many small warcs into a large one
* no license information * python * NO TEST SUITE * 1 author * An HTTP-based warc-to-zip converter
* split into 2 new repos: ia-web-commons & ia-hadoop-tools
* Generates 50gb warc files from existing warc files * Uploads to archive.org * no license information
- http://archive.org/web/researcher/cdx_legend.php
- cdx from warc - https://github.com/rajbot/CDX-Writer
Deprecated
- http://archive-access.sourceforge.net/warc/ - bunch of docs
- https://code.google.com/p/warc-tools/ - Old, discontinued shit