Difference between revisions of "Wget with Lua hooks"

From Archiveteam
Jump to navigation Jump to search
m (MOTHERFUCKER ! ! !)
(Document build dependencies)
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
* New idea: add Lua scripting to wget.
* New idea: add Lua scripting to wget.


* Work in progress: https://github.com/alard/wget-lua/tree/lua  
* Get the source from: https://github.com/ArchiveTeam/wget-lua
** The Lua scripting is patched on the "lua" branch. You can use the [https://github.com/alard/wget-lua/compare/lua#files_bucket compare branch feature] on GitHub to see the differences.
* Old repo is located at https://github.com/alard/wget-lua/tree/lua
** Alternative location: https://github.com/ArchiveTeam/wget-lua/tree/lua.
 
<!-- If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch. -->
<!-- If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch. -->
** If you get errors about 'wget.pod' while compiling, try applying [http://paste.archivingyoursh.it/raw/dekasuroda this] patch.
* If you get errors about 'wget.pod' while compiling, try applying [http://paste.archivingyoursh.it/raw/dekasuroda this] patch.
 
* Documentation (possibly outdated): https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
* Documentation: https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks


Example usage:
Example usage:
Line 14: Line 13:
</pre>
</pre>


== '''MOTHERFUCKER ! ! !''' ==
= Installation =
 
<pre>
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua
</pre>
 
= Why would this be useful? =
 
== Custom error handling ==
What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.
 
<pre>
wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end
</pre>
 
== Custom decide rules ==
Download this url or not?
 
<pre>
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end
</pre>
 
== Custom url extraction/generation ==
Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.
 
<pre>
wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end
</pre>
 
== More Examples ==


== '''MOTHERFUCKER ! ! !''' ==
Archive Team has real life scripts on the [https://github.com/archiveteam Archive Team GitHub organization]. Look for recent <code>-grab</code> projects. The Lua scripts range from simple checks to complex URL scraping.


== '''MOTHERFUCKER ! ! !''' ==
* [https://github.com/ArchiveTeam/zapd-grab/blob/master/zapd.lua zapd-grab/zapd.lua]: Avoids JavaScript monstrosity by scraping anything that looks like an URL on CDN.
* [https://github.com/ArchiveTeam/puush-grab/blob/master/puush.lua puush-grab/puush.lua]: Checks the status code and the contents and returns custom error codes.
* [https://github.com/ArchiveTeam/posterous-grab/blob/master/posterous.lua posterous-grab/posterous.lua]: Checks the status code and delays if needed.
* [https://github.com/ArchiveTeam/xanga-grab/blob/master/xanga.lua xanga-grab/xanga.lua]: Implements its own URLs scraping.
* [https://github.com/ArchiveTeam/patch-grab/blob/master/patch.lua patch-grab/patch.lua]: Scrapes URLs as it goes along and sends it off to a server to be done later.
* [https://github.com/ArchiveTeam/formspring-grab/blob/master/formspring.lua formspring-grab/formspring.lua]: Manually behaves like JavaScript and builds its own request URLs.
* [https://github.com/ArchiveTeam/hyves-grab/blob/master/hyves.lua hyves-grab/hyves.lua]: Works around JavaScript calls to pagination. Includes calling external process to decrypt ciphertext.
* [https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot.lua ArchiveBot/pipeline/archivebot.lua]: Logs results in Redis and implements custom URL checking.


== '''MOTHERFUCKER ! ! !''' ==
[[Category:Tools]]


== '''MOTHERFUCKER ! ! !''' ==
{{Navigation box}}

Revision as of 20:13, 26 January 2020

  • New idea: add Lua scripting to wget.

Example usage:

wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua

Installation

apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua

Why would this be useful?

Custom error handling

What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.

wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end

Custom decide rules

Download this url or not?

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end

Custom url extraction/generation

Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.

wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end

More Examples

Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.