Difference between revisions of "Software"

From Archiveteam
Jump to: navigation, search
(minor updates)
(proposals)
 
(17 intermediate revisions by 10 users not shown)
Line 1: Line 1:
 
__NOTOC__
 
__NOTOC__
 +
== WARC Tools ==
 +
[[The WARC Ecosystem]] has information on tools to create, read and process WARC files.
 +
 
== General Tools ==
 
== General Tools ==
  
 
* [[Wget|GNU WGET]]
 
* [[Wget|GNU WGET]]
 
** Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
 
** Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
* [[Wget with WARC output]]
+
* [https://curl.haxx.se/ cURL]
* [http://curl.haxx.se/ cURL]
+
* [https://www.httrack.com/ HTTrack] - [[HTTrack options]]
* [http://www.httrack.com/ HTTrack] - [[HTTrack options]]
 
* [http://crawler.archive.org/ Heritrix] -- what archive.org use
 
 
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
 
* [http://pavuk.sourceforge.net/ Pavuk] -- a bit flaky, but very flexible
* http://warrick.cs.odu.edu/warrick.html
+
* [https://github.com/oduwsdl/warrick Warrick] - Tool to recover lost websites using various online archives and caches.
* [http://www.crummy.com/software/BeautifulSoup/ Beautiful Soup] - Python library for web scraping
+
* [https://www.crummy.com/software/BeautifulSoup/ Beautiful Soup] - Python library for web scraping
* [http://scrapy.org/ Scrapy] - Fast python library for web scraping
+
* [https://scrapy.org/ Scrapy] - Fast python library for web scraping
* [http://splinter.cobrateam.info/ Splinter] - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
+
* [https://github.com/JustAnotherArchivist/snscrape snscrape] - Tool to scrape social networking services.
* [http://sourceforge.net/projects/wilise/ WiLiSe] '''Wi'''ki'''Li'''nk '''Se'''arch - Python script to get links to specific pages of a site through the search in a Wiki ([[wikipedia:MediaWiki|MediaWiki]]-type) has the [http://www.mediawiki.org/wiki/Api.php api.php] accessible or [http://www.mediawiki.org/wiki/Extension:LinkSearch extension LinkSearch] enabled (the project is still very immature and at the moment the code is only available in [http://sourceforge.net/p/wilise/code/1/tree/code/trunk/ this SVN repository]).
+
* [https://splinter.readthedocs.io/ Splinter] - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
 +
* [https://sourceforge.net/projects/wilise/ WiLiSe] '''Wi'''ki'''Li'''nk '''Se'''arch - Python script to get links to specific pages of a site through the search in a Wiki ([[wikipedia:MediaWiki|MediaWiki]]-type) has the [http://www.mediawiki.org/wiki/Api.php api.php] accessible or [http://www.mediawiki.org/wiki/Extension:LinkSearch extension LinkSearch] enabled (the project is still very immature and at the moment the code is only available in [http://sourceforge.net/p/wilise/code/1/tree/code/trunk/ this SVN repository]).
 +
* [[Mobile Phone Applications]] -- some notes on preserving old versions of mobile apps
  
 
== Hosted tools ==
 
== Hosted tools ==
[http://www.pinboard.in Pinboard] is a convenient social bookmarking service that will [http://pinboard.in/blog/153/ archive copies of all your bookmarks] for online viewing. The catch is that it costs $9.25 just to join, plus $25/year for the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
+
* [https://pinboard.in/ Pinboard] is a convenient social bookmarking service that will [http://pinboard.in/blog/153/ archive copies of all your bookmarks] for online viewing. The catch is that it costs $11/year, or $25/year if you want the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
 +
* [https://freeyourstuff.cc/ freeyourstuff.cc] -- Extensible open-source ([https://github.com/eloquence/freeyourstuff.cc source]) Chrome plugin allowing users to export their own content (reviews, posts, etc.). Exports to JSON format, optionally publish to freeyourstuff.cc & mirrors under Creative Commons CC0 license. Supports Yelp, [[IMDB]], TripAdvisor, [[Amazon]], GoodReads, and [[Quora]] as of July 2019.
  
 
== Site-Specific ==
 
== Site-Specific ==
Line 24: Line 28:
 
* [[Twitter]]
 
* [[Twitter]]
 
* [http://code.google.com/p/somaseek/ SomaFM]
 
* [http://code.google.com/p/somaseek/ SomaFM]
 +
* https://www.allmytweets.net/ - Download the last 3,200 tweets from any user.
  
 
== Format Specific ==
 
== Format Specific ==
Line 29: Line 34:
 
* [http://www.shlock.co.uk/Utils/OmniFlop/OmniFlop.htm OmniFlop]
 
* [http://www.shlock.co.uk/Utils/OmniFlop/OmniFlop.htm OmniFlop]
  
[[Category:Tools| ]]
+
== Proposed ==
  
== The Business 9 Women Kept A Secret For Three Decades ==
+
* [https://solidproject.org/ Solid project] attempts to make data portability a reality
 +
* [https://datatransferproject.dev/ Data transfer project] is a (promise of) a quick implementation of [[wikipedia:GDPR|GDPR]] data portability by the [[wikipedia:GAFA|GAFA]] + Twitter
  
Somewhere in West Tennessee, not far from Graceland, nine women -- or "The 9 Nanas," as they prefer to be called -- gather in the darkness of night. At 4am they begin their daily routine -- a ritual that no one, not even their husbands, knew about for 30 years. They have one mission and one mission only: to create happiness. And it all begins with baked goods.One of us starts sifting the flour and another washing the eggs,
+
== Web scraping ==
  
[[http://goodvillenews.com/The-Business-9-Women-Kept-A-Secret-For-Three-Decades-hOoGN8.html The Business 9 Women Kept A Secret For Three Decades]]
+
* See [[Site exploration]]
  
[[http://goodvillenews.com/wk.html GoodvilleNews.com - good, positive news, inspirational stories, articles]]
+
{{Navigation pager
 +
| previous = Why Back Up?
 +
| next = Formats
 +
}}
 +
{{Navigation box}}
  
== Have the Courage to Walk Alone ==
+
[[Category:Tools| ]]
 
 
The woman who follows the crowd will usually go no further than the crowd. The woman who walks alone is likely to find herself in places no one has ever been before.
 
 
 
[[http://goodvillenews.com/Have-the-Courage-to-Walk-Alone-JKIF7y.html Have the Courage to Walk Alone]]
 
 
 
[[http://goodvillenews.com/wk.html GoodvilleNews.com - good, positive news, inspirational stories, articles]]
 
 
 
== Student Goes From Homeless to Harvard ==
 
 
 
Despite being abandoned to homelessness by her parents, Dawn Loggins worked as a high school custodian by day and studied hard by night to become the first person from her school to ever be admitted to Harvard.
 
 
 
[[http://goodvillenews.com/Student-Goes-From-Homeless-to-Harvard-QX9Vg4.html Student Goes From Homeless to Harvard]]
 
 
 
[[http://goodvillenews.com/wk.html GoodvilleNews.com - good, positive news, inspirational stories, articles]]
 
 
 
== Your Life Is What You Make It ==
 
 
 
If you limit your choices only to what seems possible or reasonable, you disconnect yourself from what you truly want, and all that is left is compromise. Robert Fritz
 
 
 
[[http://goodvillenews.com/Your-Life-Is-What-You-Make-It-aTRqC9.html Your Life Is What You Make It]]
 
 
 
[[http://goodvillenews.com/wk.html GoodvilleNews.com - good, positive news, inspirational stories, articles]]
 
 
 
== Not Your Usual Panhandler ==
 
 
 
Doug Eaton wanted to celebrate his birthday on June 11 in a big way, so he turned to his friends for ideas -- ended up marking the day with random acts of kindness, including handing out free money to people passing by.
 
 
 
[[http://goodvillenews.com/Not-Your-Usual-Panhandler-6kQtGi.html Not Your Usual Panhandler]]
 
 
 
[[http://goodvillenews.com/wk.html GoodvilleNews.com - good, positive news, inspirational stories, articles]]
 

Latest revision as of 05:20, 4 December 2019

WARC Tools

The WARC Ecosystem has information on tools to create, read and process WARC files.

General Tools

  • GNU WGET
    • Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
  • cURL
  • HTTrack - HTTrack options
  • Pavuk -- a bit flaky, but very flexible
  • Warrick - Tool to recover lost websites using various online archives and caches.
  • Beautiful Soup - Python library for web scraping
  • Scrapy - Fast python library for web scraping
  • snscrape - Tool to scrape social networking services.
  • Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
  • WiLiSe WikiLink Search - Python script to get links to specific pages of a site through the search in a Wiki (MediaWiki-type) has the api.php accessible or extension LinkSearch enabled (the project is still very immature and at the moment the code is only available in this SVN repository).
  • Mobile Phone Applications -- some notes on preserving old versions of mobile apps

Hosted tools

  • Pinboard is a convenient social bookmarking service that will archive copies of all your bookmarks for online viewing. The catch is that it costs $11/year, or $25/year if you want the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.
  • freeyourstuff.cc -- Extensible open-source (source) Chrome plugin allowing users to export their own content (reviews, posts, etc.). Exports to JSON format, optionally publish to freeyourstuff.cc & mirrors under Creative Commons CC0 license. Supports Yelp, IMDB, TripAdvisor, Amazon, GoodReads, and Quora as of July 2019.

Site-Specific

Format Specific

Proposed

Web scraping

Why Back Up?SoftwareFormats


v · t · e         Archive Team
Current events

Alive... OR ARE THEY · Deathwatch · Projects

Archiveteam.jpg
Archiving projects

APKMirror · Archive.is · BetaArchive · Government Backup (#datarefuge · ftp-gov· Gmane · Internet Archive · It Died · Megalodon.jp · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite · Vaporwave.me

Blogging

Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LiveJournal · My Opera · Nolblog.hu · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd

Cloud hosting/file sharing

aDrive · AnyHub · Box · Dropbox · Docstoc · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase

Corporations

Apple · IBM · Google · Loblaw · Lycos Europe · Microsoft · Yahoo!

Events

Arab Spring · Great Ape-Snake War · Spanish Revolution

Font Repos

DaFont · Google Web Fonts · GNU FreeFont · Fontspace

Forums/Message boards

4chan · Captain Luffy Forums · College Confidential · DSLReports · ESPN Forums · forums.starwars.com · HeavenGames · Invisionfree · NeoGAF · The Classic Horror Film Board · Yahoo! Messages · Yahoo! Neighbors · Yuku.com

Gaming

Atomicgamer · Bazaar.tf · City of Heroes · Club Nintendo · Counter-Strike: Global Offensive · CS:GO Lounge · Desura · Dota 2 · Dota 2 Lounge · Emulation Zone · ESEA · GameBanana · GameMaker Sandbox · GameTrailers · Halo · HLTV.org · HQ Trivia · Infinite Crisis · joinDOTA · League of Legends · Liquipedia · Minecraft.net · Player.me · Playfire · Raptr · Steam · SteamDB · SteamGridDB · Team Fortress 2 · TF2 Outpost · Warhammer · Xfire

Image hosting

500px · AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · deviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotolog.com · Fotopedia · Frontback · Geograph Britain and Ireland · Giphy · GTF Képhost · ImageShack · Imgh.us · Imgur · Inkblazers · Instagram · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Microsoft Photosynth · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · Pixiv · Portalgraphics.net · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Snapjoy · Streetfiles · Tabblo · Tinypic · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons

Knowledge/Wikis

arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram· Horror Movie Database · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Notepad.cc · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Urban Exploration Resource · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia· Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal

Magazines/Blogs/News

Cyberpunkreview.com · Game Developer Magazine · Gigaom · Hardware Canucks · Helium · JPG Magazine · Make Magazine · Polygamia.pl · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices

Microblogging

Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Twitter · TwitLonger

Music/Audio

AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · This Is My Jam · TuneWiki · Twaud.io · WinAmp

People

Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project

Protocols/Infrastructure

FTP · Gopher · IRC · Usenet · World Wide Web
BitTorrent DHT

Q&A

Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers

Recipes/Food

Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList

Social bookmarking

Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero

Social networks

Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Friends Reunited · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · myVIP · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vine · Vkontakte · WeeWorld · Weibo · Wretch · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...

Shopping/Retail

Alibaba · AliExpress · Amazon · Apple Store · Barnes & Noble · DirectCanada · eBay · Kmart · NCIX · Printfection · RadioShack · Sears · Sears Canada · Target · The Book Depository · ThinkGeek · Toys "R" Us · Walmart

Software/code hosting

Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost  · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · Stypi · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads

Television/Radio

ABC · Austin City Limits · BBC · CBC · CBS · Computer Chronicles · CTV · Fox · G4 · Global TV · Jeopardy! · NBC · NHK · PBS · Penn & Teller: Bullshit! · The Howard Stern Show · TV News Archive (Understanding 9/11)

Torrenting/Piracy

ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz · Library Genesis

Video hosting

Academic Earth · Bambuser · Blip.tv · Epic · Google Video · Justin.tv · Niconico · Nokia Trailers · Oddshot.tv · Plays.tv · Qwiki · Skillfeed · Stickam · TED Talks · Ticker.tv · Twitch.tv · Ustream · Videoplayer.hu · Viddler · Viddy · Vidme · Vimeo · Vine · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)

Web hosting

Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Claranet Netherlands Personal Web Pages · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch· Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nifty · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Telenor · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webzdarma · Virgin Media

Web applications

Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin

Information

A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG

Projects

ArchiveCorps · Audit2014 · Emularity · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census· IRC Quotes · JSMESS · JSVLC · Just Solve the Problem · NewsGrabber · Project Newsletter · Valhalla · Web Roasting (ISP Hosting · University Web Hosting· Woohoo

Tools

ArchiveBot · ArchiveTeam Warrior (Tracker· Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)

Teams

Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam

Other

800notes · AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Discourse · Distill · Dmoz · Easel · Eircode · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · Forrst · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mobile Phone Applications · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Neopets · Quantcast · Quizilla · Salon Table Talk · Shutdownify · Slidecast · Stack Overflow · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · USA-Gov · Volán · Widgetbox · Windows Technical Preview · Wunderlist · YTMND · Zoocasa

About Archive Team

Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ