Difference between revisions of "Software"

From Archiveteam
Jump to navigation Jump to search
(Added Pinboard.in)
Line 9: Line 9:
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
* http://warrick.cs.odu.edu/warrick.html
* http://warrick.cs.odu.edu/warrick.html
* [http://www.crummy.com/software/BeautifulSoup/] Beautiful Soup - Python library for web scraping
* [http://scrapy.org/] Scrapy - Fast python library for web scraping
* [http://splinter.cobrateam.info/] Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places


== Hosted tools ==
== Hosted tools ==

Revision as of 03:27, 17 May 2011

General Tools

  • GNU WGET
    • Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
  • cURL
  • HTTrack - HTTrack options
  • Heritrix -- what archive.org use
  • Pavuk -- a bit flaky, but very flexible
  • http://warrick.cs.odu.edu/warrick.html
  • [1] Beautiful Soup - Python library for web scraping
  • [2] Scrapy - Fast python library for web scraping
  • [3] Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places

Hosted tools

Pinboard is a convenient social bookmarking service that will archive copies of all your bookmarks for online viewing. The catch is that it costs $9.25 just to join, plus $25/year for the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.

Site-Specific

Format Specific