Difference between revisions of "The WARC Ecosystem"

From Archiveteam
Jump to navigation Jump to search
Line 91: Line 91:
== The WARC format ==
== The WARC format ==


 
* A .warc file contains one of more warc records.
A .warc file contains one of more warc records.
* compression is optional
* each record is compressed via gzip. A gzip file supports multiple "members"
* compressed warcs end in .warc.gz
* According to the guidelines warc files should top out at 1gb





Revision as of 20:07, 12 April 2013

Everything about the WARC format and the tools that support it.

Information


Tools

  • name
 1 license
 2 programming language
 3 test suite
 4 # of authors
 5 description
  • wget v1.14+
 * GPL v3+
 * C
 * Has a test suite but does not test any warc functionality
 * 2+ according to the changelog
 * A non-interactive network downloader. wget also generates duplicate record ids in warc files.
 * GPL v2
 * Python
 * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
 * 3 commiters on github
 * library to work with WARC files
 - docs http://warc.readthedocs.org/en/latest/
 * BSD
 * python
 * NO TEST SUITE
 * 1 author
 * a simple HTTP proxy that saves all HTTP traffic to a file
 * MIT License
 * python 2.6
 * NO TEST SUITE
 * 4 commiters
 * warc validator, dump, search, index
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * WARC viewer for browsing the contents of a WARC file.
 - needs a firefox addon installed to work
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * Merge many small warcs into a large one
 * no license information
 * python
 * NO TEST SUITE
 * 1 author
 * An HTTP-based warc-to-zip converter
 * split into 2 new repos: ia-web-commons & ia-hadoop-tools
 * Generates 50gb warc files from existing warc files
 * Uploads to archive.org
 * no license information

Deprecated

The WARC format

  • A .warc file contains one of more warc records.
  • compression is optional
  • each record is compressed via gzip. A gzip file supports multiple "members"
  • compressed warcs end in .warc.gz
  • According to the guidelines warc files should top out at 1gb


WARC record

  • header
  • content block
  • two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line.

Example:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150

WARC named fields

Set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.

WARC content block