Difference between revisions of "Wget"

From Archiveteam
Jump to: navigation, search
m (PWNED BY PWNSQU/-\D.)
(Undo revision 1180 by Ertyu (Talk))
Line 1: Line 1:
PWNED BY PWNSQU/-\D.
+
[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use [http://curl.haxx.se/ cURL])
  
END OF THE LINE.
+
This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.
  
GOODBYE.
+
== Mirroring a website ==
 +
 
 +
When you run something like this:
 +
<pre>
 +
wget http://icanhascheezburger.com/
 +
</pre>
 +
...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:
 +
<pre>
 +
wget -m http://icanhascheezburger.com/
 +
</pre>
 +
...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.
 +
 
 +
You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).
 +
 
 +
If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:
 +
<pre>
 +
wget -mbc -np http://mitpress.mit.edu/sicp
 +
</pre>
 +
 
 +
This will tell wget to not go up the directory tree, only downwards.
 +
 
 +
== User-agents and robots.txt ==
 +
 
 +
By default, wget plays nicely with a website's robots.txt. This can lead to situations where wget won't grab anything, since the robots.txt disallows wget.
 +
 
 +
To avoid this: first, you should try using the --user-agent option:
 +
<pre>
 +
wget -mbc --user-agent="" http://website.com/
 +
</pre>
 +
This instructs wget to not send any user agent string at all. Another option for this is:
 +
<pre>
 +
wget -mbc -erobots=off http://website.com/
 +
</pre>
 +
...which tells wget to ignore robots.txt directives altogether.
 +
 
 +
== Tricks and Traps ==
 +
 
 +
* A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The '''--user-agent''' option will allow you to act like something else.
 +
* Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called ''dyingwebsite.com'' will have additional machines like ''download.dyingwebsite.com'' or ''mp3.dyingwebsite.com'') To account for this, add the following options: '''-H -Ddomain.com'''
 +
 
 +
== Essays and Reading on the Use of WGET ==
 +
 
 +
* [http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Mastering WGET] by Gina Trapani
 +
* [http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html Using wget or curl to download web sites for archival] by Phil Sung
 +
* [http://linux.about.com/od/commands/l/blcmdl1_wget.htm about.com Wget] list of commands

Revision as of 18:59, 27 October 2009

GNU Wget is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use cURL)

This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.

Mirroring a website

When you run something like this:

wget http://icanhascheezburger.com/

...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:

wget -m http://icanhascheezburger.com/

...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.

You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).

If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:

wget -mbc -np http://mitpress.mit.edu/sicp

This will tell wget to not go up the directory tree, only downwards.

User-agents and robots.txt

By default, wget plays nicely with a website's robots.txt. This can lead to situations where wget won't grab anything, since the robots.txt disallows wget.

To avoid this: first, you should try using the --user-agent option:

wget -mbc --user-agent="" http://website.com/

This instructs wget to not send any user agent string at all. Another option for this is:

wget -mbc -erobots=off http://website.com/

...which tells wget to ignore robots.txt directives altogether.

Tricks and Traps

  • A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The --user-agent option will allow you to act like something else.
  • Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called dyingwebsite.com will have additional machines like download.dyingwebsite.com or mp3.dyingwebsite.com) To account for this, add the following options: -H -Ddomain.com

Essays and Reading on the Use of WGET