The WARC Ecosystem
Everything about the WARC format and the tools that support it.
- ISO 28500 - The WARC File Format
1 license 2 programming language 3 test suite 4 has documentation 5 # of authors 6 description
- wget v1.14+
* GPL v3+ * C * Has a test suite but does not test any warc functionality * Man pages, website, blog posts all over the net * 2+ according to the changelog * A non-interactive network downloader. wget also generates duplicate record ids in warc files.
* GPL v2 * Python * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py * A readme with examples online at http://warc.readthedocs.org/en/latest/ * 3 commiters on github * library to work with WARC files
* BSD * python * NO TEST SUITE * A readme file. * 1 author * a simple HTTP proxy that saves all HTTP traffic to a file
* MIT License * python 2.6 * NO TEST SUITE * A readme file * 4 commiters * warc validator, dump, search, index
* no license information * python * NO TEST SUITE * A readme file * 1 author * WARC viewer for browsing the contents of a WARC file. - needs a firefox addon installed to work
* no license information * python * NO TEST SUITE * A readme file * 1 author * Merge many small warcs into a large one
* no license information * python * NO TEST SUITE * A readme file * 1 author * An HTTP-based warc-to-zip converter
* GPL v3 * Python 3 * yes * A readme file. * 1 author * WARCAT: Web ARChive (WARC) Archiving Tool
- https://github.com/internetarchive/archive-commons split into 2 new repos: ia-web-commons & ia-hadoop-tools
* Generates 50gb warc files from existing warc files * Uploads to archive.org * no license information
- cdx from warc - https://github.com/rajbot/CDX-Writer
- http://archive-access.sourceforge.net/warc/ - bunch of docs
- https://code.google.com/p/warc-tools/ - Old, discontinued shit
The WARC format
- A .warc file contains one of more warc records.
- compression is optional
- each record is compressed via gzip. A gzip file supports multiple "members"
- compressed warcs end in .warc.gz
- According to the guidelines warc files should top out at 1gb
- content block
- two newlines
WARC record header
The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line.
WARC/1.0 WARC-Type: request WARC-Target-URI: http://xbox.gamespy.com/ Content-Type: application/http;msgtype=request WARC-Date: 2013-04-02T16:12:40Z WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f> WARC-IP-Address: 22.214.171.124 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f> WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4 Content-Length: 150
WARC named fields
Set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.