Software

From Archiveteam
Revision as of 19:24, 5 December 2017 by Ola norsk (talk | contribs) (→‎Hosted tools: added link to webrecorder github repo)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

WARC Tools

The WARC Ecosystem includes information on wget, Heritrix and a lot of little but handy tools to create, read and process WARC files.

General Tools

  • GNU WGET
    • Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
  • cURL
  • HTTrack - HTTrack options
  • Pavuk -- a bit flaky, but very flexible
  • http://warrick.cs.odu.edu/warrick.html
  • Beautiful Soup - Python library for web scraping
  • Scrapy - Fast python library for web scraping
  • Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
  • WiLiSe WikiLink Search - Python script to get links to specific pages of a site through the search in a Wiki (MediaWiki-type) has the api.php accessible or extension LinkSearch enabled (the project is still very immature and at the moment the code is only available in this SVN repository).
  • Mobile Phone Applications -- some notes on preserving old versions of mobile apps
  • freeyourstuff.cc -- Extensible open-source (source) Chrome plugin allowing users to export their own content (reviews, posts, etc.). Exports to JSON format, optionally publish to freeyourstuff.cc & mirrors under Creative Commons CC0 license. Supports Yelp, IMDB, TripAdvisor, Amazon, GoodReads, and Quora as of 22:52, 11 June 2016 (EDT)

Hosted tools

  • Pinboard is a convenient social bookmarking service that will archive copies of all your bookmarks for online viewing. The catch is that it costs $9.25 just to join, plus $25/year for the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
  • Webrecorder is both a tool to create high-fidelity, interactive web archives of any web site you browse in WARC format and a platform to make those recordings accessible. See their FAQ page or their Github repo for more information.

Site-Specific

Format Specific

Web scraping

Why Back Up?SoftwareFormats