Difference between revisions of "Wget with Lua hooks"

From Archiveteam
Jump to navigation Jump to search
(Update link to the current repo)
(Document build dependencies)
(One intermediate revision by the same user not shown)
Line 11: Line 11:
<pre>
<pre>
wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua
wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua
</pre>
= Installation =
<pre>
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua
</pre>
</pre>



Revision as of 20:13, 26 January 2020

  • New idea: add Lua scripting to wget.

Example usage:

wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua

Installation

apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua

Why would this be useful?

Custom error handling

What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.

wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end

Custom decide rules

Download this url or not?

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end

Custom url extraction/generation

Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.

wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end

More Examples

Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.