Difference between revisions of "The WARC Ecosystem"
Jump to navigation
Jump to search
(→Tools) |
|||
Line 1: | Line 1: | ||
Everything about the WARC format and the tools that support it. | Everything about the WARC format and the tools that support it. | ||
== Information == | |||
* https://en.wikipedia.org/wiki/Web_ARChive | |||
* https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817 | |||
* http://www.archiveteam.org/index.php?title=Wget_with_WARC_output | |||
* [http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ISO 28500 - The WARC File Format] | |||
* http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml | |||
Revision as of 19:53, 12 April 2013
Everything about the WARC format and the tools that support it.
Information
- https://en.wikipedia.org/wiki/Web_ARChive
- https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817
- http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
- ISO 28500 - The WARC File Format
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
Tools
- name
1 license 2 programming language 3 test suite 4 # of authors 5 description
- wget v1.14+
* GPL v3+ * C * Has a test suite but does not test any warc functionality * 2+ according to the changelog * A non-interactive network downloader.
* GPL v2 * Python * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py * 3 commiters on github * library to work with WARC files - docs http://warc.readthedocs.org/en/latest/
* BSD * python * NO TEST SUITE * 1 author * a simple HTTP proxy that saves all HTTP traffic to a file
* MIT License * python 2.6 * NO TEST SUITE * 4 commiters * warc validator, dump, search, index
* no license information * python * NO TEST SUITE * 1 author * WARC viewer for browsing the contents of a WARC file. - needs a firefox addon installed to work
* no license information * python * NO TEST SUITE * 1 author * Merge many small warcs into a large one
* no license information * python * NO TEST SUITE * 1 author * An HTTP-based warc-to-zip converter
* split into 2 new repos: ia-web-commons & ia-hadoop-tools
* Generates 50gb warc files from existing warc files * Uploads to archive.org * no license information
- http://archive.org/web/researcher/cdx_legend.php
- cdx from warc - https://github.com/rajbot/CDX-Writer
Deprecated
- http://archive-access.sourceforge.net/warc/ - bunch of docs
- https://code.google.com/p/warc-tools/ - Old, discontinued shit