The WARC Ecosystem
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Everything about the WARC format and the tools that support it.
Information
- https://en.wikipedia.org/wiki/Web_ARChive
- https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817
- http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
- ISO 28500 - The WARC File Format
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
Tools
- name
1 license 2 programming language 3 test suite 4 # of authors 5 description
- wget v1.14+
* GPL v3+ * C * Has a test suite but does not test any warc functionality * 2+ according to the changelog * A non-interactive network downloader.
* GPL v2 * Python * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py * 3 commiters on github * library to work with WARC files - docs http://warc.readthedocs.org/en/latest/
* BSD * python * NO TEST SUITE * 1 author * a simple HTTP proxy that saves all HTTP traffic to a file
* MIT License * python 2.6 * NO TEST SUITE * 4 commiters * warc validator, dump, search, index
* no license information * python * NO TEST SUITE * 1 author * WARC viewer for browsing the contents of a WARC file. - needs a firefox addon installed to work
* no license information * python * NO TEST SUITE * 1 author * Merge many small warcs into a large one
* no license information * python * NO TEST SUITE * 1 author * An HTTP-based warc-to-zip converter
* split into 2 new repos: ia-web-commons & ia-hadoop-tools
* Generates 50gb warc files from existing warc files * Uploads to archive.org * no license information
- http://archive.org/web/researcher/cdx_legend.php
- cdx from warc - https://github.com/rajbot/CDX-Writer
Deprecated
- http://archive-access.sourceforge.net/warc/ - bunch of docs
- https://code.google.com/p/warc-tools/ - Old, discontinued shit