Wget

From Archiveteam
Revision as of 12:42, 7 January 2009 by Jscott (talk | contribs)
Jump to navigation Jump to search

GNU Wget is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use.

This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.

Tricks and Traps

A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegard archivists are not good web citizens in this sense. The --user-agent command will allow you to act like something else.

Essays and Reading on the Use of WGET