HTTrack options

From Archiveteam
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Good options to use for httrack to mirror a large-ish site.

Quick copy and paste

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -A100000000 -#L500000000 'YOURURL'
  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

A rundown of the previous options

  • --connection-per-second=50: This allows for up to 50 connections per second.
  • --sockets=80: Opens up to 80 sockets. If this gives you errors, lower this to 48.
  • --disable-security-limits,-A100000000: By default, HTTrack attempts to play nicely with webservers, and tries not to overload them by limiting the download speed to 25kbps. On text-based sites this is normally good, but it becomes a hassle when the site is image-heavy. The first option disables the forced limit and the second one raises the limit to a large amount.
  • -s0: Tells HTTrack to disobey robots.txt.
  • -F: Sets the user agent.
  • -#L500000000: Raises the maximum amount of links HTTrack fetches to 500M. Raise if needed.
  • -n: This gets all nearby files (all files shown on a page), rather than only those on the domain name, which is HTTracks default behavior.

Other options


NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.