Difference between revisions of "Wget"

From Archiveteam
Jump to navigation Jump to search
Line 1: Line 1:
[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with it's complete directory tree using a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use [http://curl.haxx.se/ cURL]. If it can back up data, it's useful.)
[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using Wget, it is possible to grab a large chunk of data, or mirror an entire website with it's complete directory tree using a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use [http://curl.haxx.se/ cURL]. If it can back up data, it's useful).


This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, PDF documents, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.
This guide will not attempt to explain all possible uses of Wget; rather, this is intended to be a concise intro to using Wget, specifically geared towards using the tool to archive data such as podcasts, PDF documents, or entire websites. Issues such as using Wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.


== Mirroring a website ==
== Mirroring a website ==
Line 9: Line 9:
wget http://icanhascheezburger.com/
wget http://icanhascheezburger.com/
</pre>
</pre>
...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:
...Wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:
<pre>
<pre>
wget -m http://icanhascheezburger.com/
wget -m http://icanhascheezburger.com/
</pre>
</pre>
...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.
...then Wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.


You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).
You'll probably want to pair -m with -c (which tells Wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).


If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:
If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:
Line 22: Line 22:
</pre>
</pre>


This will tell wget to not go up the directory tree, only downwards.
This will tell Wget to not go up the directory tree, only downwards.


== User-agents and robots.txt ==
== User-agents and robots.txt ==


By default, wget plays nicely with a website's robots.txt. This can lead to situations where wget won't grab anything, since the robots.txt disallows wget.
By default, Wget plays nicely with a website's robots.txt. This can lead to situations where Wget won't grab anything, since the robots.txt disallows Wget.


To avoid this: first, you should try using the --user-agent option:
To avoid this: first, you should try using the --user-agent option:
Line 32: Line 32:
wget -mbc --user-agent="" http://website.com/
wget -mbc --user-agent="" http://website.com/
</pre>
</pre>
This instructs wget to not send any user agent string at all. Another option for this is:
This instructs Wget to not send any user agent string at all. Another option for this is:
<pre>
<pre>
wget -mbc -e robots=off http://website.com/
wget -mbc -e robots=off http://website.com/
</pre>
</pre>
...which tells wget to ignore robots.txt directives altogether.
...which tells Wget to ignore robots.txt directives altogether.


You can put --wait 1 to add a delay, to be nice with server.
You can put --wait 1 to add a delay, to be nice with server.
Line 44: Line 44:
* A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The '''--user-agent''' option will allow you to act like something else.
* A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The '''--user-agent''' option will allow you to act like something else.
* Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called ''dyingwebsite.com'' will have additional machines like ''download.dyingwebsite.com'' or ''mp3.dyingwebsite.com'') To account for this, add the following options: '''-H -Ddomain.com'''
* Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called ''dyingwebsite.com'' will have additional machines like ''download.dyingwebsite.com'' or ''mp3.dyingwebsite.com'') To account for this, add the following options: '''-H -Ddomain.com'''
== Wget for Windows ==
Windows users can download [http://gnuwin32.sourceforge.net/packages/wget.htm Wget for Windows], part of the [http://gnuwin32.sourceforge.net/ GNUWin32 project].  After installation, you will probably want to add it to your Path so that you can run it directly from the command prompt instead of specifying its absolute file path (i.e. "wget" instead of "C:\Program Files\GNUWin32\bin\wget.exe").
These are the instructions for Windows 7 users.  Prior versions should be relatively similar.
#Install Wget
#Right-click My Computer and select Properties
#Select Advanced System Settings from the left
#Click the Environment Variables button in the bottom-right corner
#Under System Variables, find the Path variable and click Edit
#Carefully insert the path to Wget's bin folder followed by a semi-colon.  Getting this wrong could cause some nasty system problems
#*Your Wget path should be inserted like this: C:\Program Files\GnuWin32\bin;
#When done, click OK through all the dialog boxes you opened
#The changes should apply immediately under Windows 7.  Older versions may require a reboot
#To test the settings, open a command prompt and enter "wget"


== Parallel downloading ==
== Parallel downloading ==
Line 51: Line 66:


* [http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Mastering WGET] by Gina Trapani
* [http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Mastering WGET] by Gina Trapani
* [http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html Using wget or curl to download web sites for archival] by Phil Sung
* [http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html Using Wget or curl to download web sites for archival] by Phil Sung
* [http://linux.about.com/od/commands/l/blcmdl1_wget.htm about.com Wget] list of commands
* [http://linux.about.com/od/commands/l/blcmdl1_wget.htm about.com Wget] list of commands


[[Category:Tools]]
[[Category:Tools]]

Revision as of 03:02, 12 February 2011

GNU Wget is a free utility for non-interactive download of files from the Web. Using Wget, it is possible to grab a large chunk of data, or mirror an entire website with it's complete directory tree using a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use cURL. If it can back up data, it's useful).

This guide will not attempt to explain all possible uses of Wget; rather, this is intended to be a concise intro to using Wget, specifically geared towards using the tool to archive data such as podcasts, PDF documents, or entire websites. Issues such as using Wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.

Mirroring a website

When you run something like this:

wget http://icanhascheezburger.com/

...Wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:

wget -m http://icanhascheezburger.com/

...then Wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.

You'll probably want to pair -m with -c (which tells Wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).

If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:

wget -mbc -np http://mitpress.mit.edu/sicp

This will tell Wget to not go up the directory tree, only downwards.

User-agents and robots.txt

By default, Wget plays nicely with a website's robots.txt. This can lead to situations where Wget won't grab anything, since the robots.txt disallows Wget.

To avoid this: first, you should try using the --user-agent option:

wget -mbc --user-agent="" http://website.com/

This instructs Wget to not send any user agent string at all. Another option for this is:

wget -mbc -e robots=off http://website.com/

...which tells Wget to ignore robots.txt directives altogether.

You can put --wait 1 to add a delay, to be nice with server.

Tricks and Traps

  • A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The --user-agent option will allow you to act like something else.
  • Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called dyingwebsite.com will have additional machines like download.dyingwebsite.com or mp3.dyingwebsite.com) To account for this, add the following options: -H -Ddomain.com

Wget for Windows

Windows users can download Wget for Windows, part of the GNUWin32 project. After installation, you will probably want to add it to your Path so that you can run it directly from the command prompt instead of specifying its absolute file path (i.e. "wget" instead of "C:\Program Files\GNUWin32\bin\wget.exe").

These are the instructions for Windows 7 users. Prior versions should be relatively similar.

  1. Install Wget
  2. Right-click My Computer and select Properties
  3. Select Advanced System Settings from the left
  4. Click the Environment Variables button in the bottom-right corner
  5. Under System Variables, find the Path variable and click Edit
  6. Carefully insert the path to Wget's bin folder followed by a semi-colon. Getting this wrong could cause some nasty system problems
    • Your Wget path should be inserted like this: C:\Program Files\GnuWin32\bin;
  7. When done, click OK through all the dialog boxes you opened
  8. The changes should apply immediately under Windows 7. Older versions may require a reboot
  9. To test the settings, open a command prompt and enter "wget"

Parallel downloading

http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/

Essays and Reading on the Use of WGET