Difference between revisions of "Wget"

From Archiveteam
Jump to navigation Jump to search
(Page stub creation; shameless cut-n-paste from wget man page; statement of intent. More to come soon.)
 
Line 1: Line 1:
[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command.
[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use.


This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.
This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.


More to come in the (very) near future.
== Tricks and Traps ==
 
A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegard archivists are not good web citizens in this sense. The '''--user-agent''' command will allow you to act like something else.
 
== Essays and Reading on the Use of WGET ==
 
* [http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Mastering WGET] by Gina Trapani
* [http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html Using wget or curl to download web sites for archival] by Phil Sung

Revision as of 12:42, 7 January 2009

GNU Wget is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use.

This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.

Tricks and Traps

A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegard archivists are not good web citizens in this sense. The --user-agent command will allow you to act like something else.

Essays and Reading on the Use of WGET