Wget with Lua hooks

From Archiveteam
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
  • New idea: add Lua scripting to wget.

Example usage:

wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua

Installation

apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua

Why would this be useful?

Custom error handling

What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.

wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end

Custom decide rules

Download this url or not?

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end

Custom url extraction/generation

Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.

wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end

More Examples

Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.