Difference between revisions of "Software"

From Archiveteam
Jump to navigation Jump to search
(proposals)
(13 intermediate revisions by 8 users not shown)
Line 1: Line 1:
__NOTOC__
__NOTOC__
== WARC Tools ==
== WARC Tools ==
[[The_WARC_Ecosystem]] includes information on wget, Heritrix
[[The WARC Ecosystem]] has information on tools to create, read and process WARC files.


== General Tools ==
== General Tools ==
Line 7: Line 7:
* [[Wget|GNU WGET]]
* [[Wget|GNU WGET]]
** Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
** Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
* [http://curl.haxx.se/ cURL]
* [https://curl.haxx.se/ cURL]
* [http://www.httrack.com/ HTTrack] - [[HTTrack options]]
* [https://www.httrack.com/ HTTrack] - [[HTTrack options]]
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
* http://warrick.cs.odu.edu/warrick.html
* [https://github.com/oduwsdl/warrick Warrick] - Tool to recover lost websites using various online archives and caches.
* [http://www.crummy.com/software/BeautifulSoup/ Beautiful Soup] - Python library for web scraping
* [https://www.crummy.com/software/BeautifulSoup/ Beautiful Soup] - Python library for web scraping
* [http://scrapy.org/ Scrapy] - Fast python library for web scraping
* [https://scrapy.org/ Scrapy] - Fast python library for web scraping
* [http://splinter.cobrateam.info/ Splinter] - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
* [https://github.com/JustAnotherArchivist/snscrape snscrape] - Tool to scrape social networking services.
* [http://sourceforge.net/projects/wilise/ WiLiSe] '''Wi'''ki'''Li'''nk '''Se'''arch - Python script to get links to specific pages of a site through the search in a Wiki ([[wikipedia:MediaWiki|MediaWiki]]-type) has the [http://www.mediawiki.org/wiki/Api.php api.php] accessible or [http://www.mediawiki.org/wiki/Extension:LinkSearch extension LinkSearch] enabled (the project is still very immature and at the moment the code is only available in [http://sourceforge.net/p/wilise/code/1/tree/code/trunk/ this SVN repository]).
* [https://splinter.readthedocs.io/ Splinter] - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
* [https://sourceforge.net/projects/wilise/ WiLiSe] '''Wi'''ki'''Li'''nk '''Se'''arch - Python script to get links to specific pages of a site through the search in a Wiki ([[wikipedia:MediaWiki|MediaWiki]]-type) has the [http://www.mediawiki.org/wiki/Api.php api.php] accessible or [http://www.mediawiki.org/wiki/Extension:LinkSearch extension LinkSearch] enabled (the project is still very immature and at the moment the code is only available in [http://sourceforge.net/p/wilise/code/1/tree/code/trunk/ this SVN repository]).
* [[Mobile Phone Applications]] -- some notes on preserving old versions of mobile apps


== Hosted tools ==
== Hosted tools ==
[http://www.pinboard.in Pinboard] is a convenient social bookmarking service that will [http://pinboard.in/blog/153/ archive copies of all your bookmarks] for online viewing. The catch is that it costs $9.25 just to join, plus $25/year for the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
* [https://pinboard.in/ Pinboard] is a convenient social bookmarking service that will [http://pinboard.in/blog/153/ archive copies of all your bookmarks] for online viewing. The catch is that it costs $11/year, or $25/year if you want the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
* [https://freeyourstuff.cc/ freeyourstuff.cc] -- Extensible open-source ([https://github.com/eloquence/freeyourstuff.cc source]) Chrome plugin allowing users to export their own content (reviews, posts, etc.). Exports to JSON format, optionally publish to freeyourstuff.cc & mirrors under Creative Commons CC0 license. Supports Yelp, [[IMDB]], TripAdvisor, [[Amazon]], GoodReads, and [[Quora]] as of July 2019.


== Site-Specific ==
== Site-Specific ==
Line 25: Line 28:
* [[Twitter]]
* [[Twitter]]
* [http://code.google.com/p/somaseek/ SomaFM]
* [http://code.google.com/p/somaseek/ SomaFM]
* http://www.allmytweets.net/ - Download the last 3,200 tweets from any user.
* https://www.allmytweets.net/ - Download the last 3,200 tweets from any user.


== Format Specific ==
== Format Specific ==


* [http://www.shlock.co.uk/Utils/OmniFlop/OmniFlop.htm OmniFlop]
* [http://www.shlock.co.uk/Utils/OmniFlop/OmniFlop.htm OmniFlop]
== Proposed ==
* [https://solidproject.org/ Solid project] attempts to make data portability a reality
* [https://datatransferproject.dev/ Data transfer project] is a (promise of) a quick implementation of [[wikipedia:GDPR|GDPR]] data portability by the [[wikipedia:GAFA|GAFA]] + Twitter
== Web scraping ==
* See [[Site exploration]]
{{Navigation pager
| previous = Why Back Up?
| next = Formats
}}
{{Navigation box}}


[[Category:Tools| ]]
[[Category:Tools| ]]

Revision as of 05:20, 4 December 2019

WARC Tools

The WARC Ecosystem has information on tools to create, read and process WARC files.

General Tools

  • GNU WGET
    • Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
  • cURL
  • HTTrack - HTTrack options
  • Pavuk -- a bit flaky, but very flexible
  • Warrick - Tool to recover lost websites using various online archives and caches.
  • Beautiful Soup - Python library for web scraping
  • Scrapy - Fast python library for web scraping
  • snscrape - Tool to scrape social networking services.
  • Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
  • WiLiSe WikiLink Search - Python script to get links to specific pages of a site through the search in a Wiki (MediaWiki-type) has the api.php accessible or extension LinkSearch enabled (the project is still very immature and at the moment the code is only available in this SVN repository).
  • Mobile Phone Applications -- some notes on preserving old versions of mobile apps

Hosted tools

  • Pinboard is a convenient social bookmarking service that will archive copies of all your bookmarks for online viewing. The catch is that it costs $11/year, or $25/year if you want the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
  • freeyourstuff.cc -- Extensible open-source (source) Chrome plugin allowing users to export their own content (reviews, posts, etc.). Exports to JSON format, optionally publish to freeyourstuff.cc & mirrors under Creative Commons CC0 license. Supports Yelp, IMDB, TripAdvisor, Amazon, GoodReads, and Quora as of July 2019.

Site-Specific

Format Specific

Proposed

Web scraping

Why Back Up?SoftwareFormats