Difference between revisions of "Wget with Lua hooks"
Jump to navigation
Jump to search
Megalanya1 (talk | contribs) m (MOTHERFUCKER ! ! !) |
(Document build dependencies) |
||
(4 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
* New idea: add Lua scripting to wget. | * New idea: add Lua scripting to wget. | ||
* | * Get the source from: https://github.com/ArchiveTeam/wget-lua | ||
* | * Old repo is located at https://github.com/alard/wget-lua/tree/lua | ||
<!-- If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch. --> | <!-- If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch. --> | ||
* If you get errors about 'wget.pod' while compiling, try applying [http://paste.archivingyoursh.it/raw/dekasuroda this] patch. | |||
* Documentation (possibly outdated): https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks | |||
* Documentation: https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks | |||
Example usage: | Example usage: | ||
Line 14: | Line 13: | ||
</pre> | </pre> | ||
== ''' | = Installation = | ||
<pre> | |||
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev | |||
git clone https://github.com/ArchiveTeam/wget-lua | |||
cd wget-lua | |||
./bootstrap | |||
./configure | |||
make | |||
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua | |||
</pre> | |||
= Why would this be useful? = | |||
== Custom error handling == | |||
What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error. | |||
<pre> | |||
wget.callbacks.httploop_result = function(url, err, http_stat) | |||
if http_stat.statcode == 500 then | |||
-- try again | |||
return wget.actions.CONTINUE | |||
elseif http_statcode == 404 then | |||
-- stop | |||
return wget.actions.EXIT | |||
else | |||
-- let wget decide | |||
return wget.actions.NOTHING | |||
end | |||
end | |||
</pre> | |||
== Custom decide rules == | |||
Download this url or not? | |||
<pre> | |||
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict) | |||
if string.find(urlpos.url, "textfiles.com") then | |||
-- always download | |||
return true | |||
elseif string.find(urlpos.url, "archive.org") then | |||
-- never! | |||
return false | |||
else | |||
-- follow wget's advice | |||
return verdict | |||
end | |||
end | |||
</pre> | |||
== Custom url extraction/generation == | |||
Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page. | |||
<pre> | |||
wget.callbacks.get_urls = function(file, url, is_css, iri) | |||
if string.find(url, ".com/profile/[^/]+/$") then | |||
-- make sure wget downloads the user's photo page | |||
-- and custom profile photo | |||
return { | |||
{ url=url.."photo.html", | |||
link_expect_html=1, | |||
link_expect_css=0 }, | |||
{ url=url.."photo.jpg", | |||
link_expect_html=0, | |||
link_expect_css=0 } | |||
} | |||
else | |||
-- no new urls to add | |||
return {} | |||
end | |||
end | |||
</pre> | |||
== More Examples == | |||
Archive Team has real life scripts on the [https://github.com/archiveteam Archive Team GitHub organization]. Look for recent <code>-grab</code> projects. The Lua scripts range from simple checks to complex URL scraping. | |||
* [https://github.com/ArchiveTeam/zapd-grab/blob/master/zapd.lua zapd-grab/zapd.lua]: Avoids JavaScript monstrosity by scraping anything that looks like an URL on CDN. | |||
* [https://github.com/ArchiveTeam/puush-grab/blob/master/puush.lua puush-grab/puush.lua]: Checks the status code and the contents and returns custom error codes. | |||
* [https://github.com/ArchiveTeam/posterous-grab/blob/master/posterous.lua posterous-grab/posterous.lua]: Checks the status code and delays if needed. | |||
* [https://github.com/ArchiveTeam/xanga-grab/blob/master/xanga.lua xanga-grab/xanga.lua]: Implements its own URLs scraping. | |||
* [https://github.com/ArchiveTeam/patch-grab/blob/master/patch.lua patch-grab/patch.lua]: Scrapes URLs as it goes along and sends it off to a server to be done later. | |||
* [https://github.com/ArchiveTeam/formspring-grab/blob/master/formspring.lua formspring-grab/formspring.lua]: Manually behaves like JavaScript and builds its own request URLs. | |||
* [https://github.com/ArchiveTeam/hyves-grab/blob/master/hyves.lua hyves-grab/hyves.lua]: Works around JavaScript calls to pagination. Includes calling external process to decrypt ciphertext. | |||
* [https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot.lua ArchiveBot/pipeline/archivebot.lua]: Logs results in Redis and implements custom URL checking. | |||
[[Category:Tools]] | |||
{{Navigation box}} |
Revision as of 20:13, 26 January 2020
- New idea: add Lua scripting to wget.
- Get the source from: https://github.com/ArchiveTeam/wget-lua
- Old repo is located at https://github.com/alard/wget-lua/tree/lua
- If you get errors about 'wget.pod' while compiling, try applying this patch.
- Documentation (possibly outdated): https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
Example usage:
wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua
Installation
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev git clone https://github.com/ArchiveTeam/wget-lua cd wget-lua ./bootstrap ./configure make mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua
Why would this be useful?
Custom error handling
What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.
wget.callbacks.httploop_result = function(url, err, http_stat) if http_stat.statcode == 500 then -- try again return wget.actions.CONTINUE elseif http_statcode == 404 then -- stop return wget.actions.EXIT else -- let wget decide return wget.actions.NOTHING end end
Custom decide rules
Download this url or not?
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict) if string.find(urlpos.url, "textfiles.com") then -- always download return true elseif string.find(urlpos.url, "archive.org") then -- never! return false else -- follow wget's advice return verdict end end
Custom url extraction/generation
Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.
wget.callbacks.get_urls = function(file, url, is_css, iri) if string.find(url, ".com/profile/[^/]+/$") then -- make sure wget downloads the user's photo page -- and custom profile photo return { { url=url.."photo.html", link_expect_html=1, link_expect_css=0 }, { url=url.."photo.jpg", link_expect_html=0, link_expect_css=0 } } else -- no new urls to add return {} end end
More Examples
Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab
projects. The Lua scripts range from simple checks to complex URL scraping.
- zapd-grab/zapd.lua: Avoids JavaScript monstrosity by scraping anything that looks like an URL on CDN.
- puush-grab/puush.lua: Checks the status code and the contents and returns custom error codes.
- posterous-grab/posterous.lua: Checks the status code and delays if needed.
- xanga-grab/xanga.lua: Implements its own URLs scraping.
- patch-grab/patch.lua: Scrapes URLs as it goes along and sends it off to a server to be done later.
- formspring-grab/formspring.lua: Manually behaves like JavaScript and builds its own request URLs.
- hyves-grab/hyves.lua: Works around JavaScript calls to pagination. Includes calling external process to decrypt ciphertext.
- ArchiveBot/pipeline/archivebot.lua: Logs results in Redis and implements custom URL checking.