Difference between revisions of "The WARC Ecosystem"

From Archiveteam
Jump to navigation Jump to search
(Created page with "Everything about the WARC format and the tools that support it. == Tools == * name 1 license 2 programming language 3 test suite 4 # of authors 5 description *...")
 
Line 12: Line 12:
   5 description
   5 description


* wget v1.14+
  * GPL v3+
  * C
  * Has a test suite but does not test any warc functionality
  * 2+ according to the changelog
  * A non-interactive network downloader.


* https://github.com/internetarchive/warc
* https://github.com/internetarchive/warc
Line 71: Line 77:
* http://archive.org/web/researcher/cdx_legend.php
* http://archive.org/web/researcher/cdx_legend.php
* cdx from warc - https://github.com/rajbot/CDX-Writer
* cdx from warc - https://github.com/rajbot/CDX-Writer


== Deprecated ==
== Deprecated ==

Revision as of 19:50, 12 April 2013

Everything about the WARC format and the tools that support it.


Tools

  • name
 1 license
 2 programming language
 3 test suite
 4 # of authors
 5 description
  • wget v1.14+
 * GPL v3+
 * C
 * Has a test suite but does not test any warc functionality
 * 2+ according to the changelog
 * A non-interactive network downloader.
 * GPL v2
 * Python
 * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
 * 3 commiters on github
 * library to work with WARC files
 - docs http://warc.readthedocs.org/en/latest/
 * BSD
 * python
 * NO TEST SUITE
 * 1 author
 * a simple HTTP proxy that saves all HTTP traffic to a file
 * MIT License
 * python 2.6
 * NO TEST SUITE
 * 4 commiters
 * warc validator, dump, search, index
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * WARC viewer for browsing the contents of a WARC file.
 - needs a firefox addon installed to work
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * Merge many small warcs into a large one
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * An HTTP-based warc-to-zip converter
 * split into 2 new repos: ia-web-commons & ia-hadoop-tools
 * Generates 50gb warc files from existing warc files
 * Uploads to archive.org
 * no license information

Deprecated