Wget with WARC output

From Archiveteam
Jump to navigation Jump to search

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers. With Wget, that's difficult. With a few tricks you can keep the response headers, but there is no option to save the request headers. You also lose the response headers that don't produce an HTML page: Wget doesn't save redirects and 404 responses.

That should be easier. I (Alard) have been working on a way to have Wget write its results to a WARC (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, it's possible to save both the request and the response headers. It also provides a clean way to store redirects and 404 responses.

There is an additional advantage: if Wget writes these headers to a WARC file, it is no longer necessary to use the --save-headers to save them at the top of each downloaded file. There is need to remove these headers afterwards to produce a clean copy: the mirror produced by Wget are useable without post-processing.

About the modified version of Wget

I modified the current development version of Wget and added the WARC-writing code to it. I used the open source warc-tools library to provide the WARC storage functions.

Current experimental version: wget-warc-20110704-1.12-2507.tar.bz2

GitHub repository: https://github.com/alard/wget-warc/

Warning: This is all very experimental and I'm not even sure if this is a good idea. But please give it a try and let me know what you think!

Compiling

To compile this modified version of Wget, you'll need to install the uuid library (e.g. sudo apt-get install uuid-dev). You can then configure and build Wget with

./configure && make

Usage

To download a file and save the request and response data to a WARC file, run this:

src/wget "http://www.archiveteam.org/" --warc-file="at.warc"

This will download the file to index.html, but it will also create a file at.warc. If you open this file, you will see that it contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have a compressed WARC file, simply add .gz to the file name:

src/wget "http://www.archiveteam.org/" --warc-file="at.warc.gz"

Saving one file is nice, but the warc-file option becomes even more powerful if you combine it with Wget's mirror option: (You may want to try this with a smaller site than the AT wiki.)

src/wget "http://www.archiveteam.org/" --mirror --warc-file="at.warc.gz"

If uncompress at.warc.gz and look at it, you'll see that it contains WARC records for every request and response: it is a complete copy of the mirrored site, while at the same time Wget also created the normal mirror of the site.

WARC file format

The WARC file format is an ISO standard. The official specification of ISO 28500:2009 is not available for free. However, the final draft is free, and is supposed to be technically equivalent to the official standard.

The WARC usage task force has published WARC implementation guidelines with additional recommendations.