Wget with Lua hooks
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
- New idea: add Lua scripting to wget.
- Get the source from: https://github.com/ArchiveTeam/wget-lua
- Old repo is located at https://github.com/alard/wget-lua/tree/lua
- If you get errors about 'wget.pod' while compiling, try applying this patch.
- Documentation: https://github.com/ArchiveTeam/wget-lua/wiki#wget-with-lua-hooks
Example usage:
wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua
Installation
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev git clone https://github.com/ArchiveTeam/wget-lua cd wget-lua ./bootstrap ./configure make mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua
Why would this be useful?
Custom error handling
What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.
wget.callbacks.httploop_result = function(url, err, http_stat) if http_stat.statcode == 500 then -- try again return wget.actions.CONTINUE elseif http_statcode == 404 then -- stop return wget.actions.EXIT else -- let wget decide return wget.actions.NOTHING end end
Custom decide rules
Download this url or not?
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict) if string.find(urlpos.url, "textfiles.com") then -- always download return true elseif string.find(urlpos.url, "archive.org") then -- never! return false else -- follow wget's advice return verdict end end
Custom url extraction/generation
Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.
wget.callbacks.get_urls = function(file, url, is_css, iri) if string.find(url, ".com/profile/[^/]+/$") then -- make sure wget downloads the user's photo page -- and custom profile photo return { { url=url.."photo.html", link_expect_html=1, link_expect_css=0 }, { url=url.."photo.jpg", link_expect_html=0, link_expect_css=0 } } else -- no new urls to add return {} end end
More Examples
Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab
projects. The Lua scripts range from simple checks to complex URL scraping.
- zapd-grab/zapd.lua: Avoids JavaScript monstrosity by scraping anything that looks like an URL on CDN.
- puush-grab/puush.lua: Checks the status code and the contents and returns custom error codes.
- posterous-grab/posterous.lua: Checks the status code and delays if needed.
- xanga-grab/xanga.lua: Implements its own URLs scraping.
- patch-grab/patch.lua: Scrapes URLs as it goes along and sends it off to a server to be done later.
- formspring-grab/formspring.lua: Manually behaves like JavaScript and builds its own request URLs.
- hyves-grab/hyves.lua: Works around JavaScript calls to pagination. Includes calling external process to decrypt ciphertext.
- ArchiveBot/pipeline/archivebot.lua: Logs results in Redis and implements custom URL checking.