GitHub

From Archiveteam
Jump to: navigation, search
GitHub
GitHub logo
A screen shot of the GitHub home page taken on 2015-11-08
A screen shot of the GitHub home page taken on 2015-11-08
URL GitHub [IA] [WebCite] [archive.today]
Project status Online!
Archiving status Not saved yet
Project source Unknown
Project tracker Unknown
IRC channel #archiveteam
See also GitHub Downloads

GitHub is a software repository powered by Git. Does not seem to have any site issues, often 24 hours uptime (see site status). Looks pretty sunny at the moment, but when disaster strikes it would be a problem archiving the private repositories.

As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories 1,117,147 public repositories are forks, which greatly reduces the amount of data required to archive it. As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.

Backup tools

"git clone" is the simplest one. However, it does not get some project data that is not stored in git, including issue reports, comments, pull requests.

github-backup runs in a git repository and chases down that information, committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well.

githubarchive.org and GHTorrent are both creating archives of the GitHub "timeline", that is, all events like git pushes, forks, created issues, pull requests, etc.

codearchive.org Effort to backup all the versions of all the repos on GitHub and other sources. Slides from a talk about it.

See also Software Heritage.

Github Replacement Engines

If we ever have to archive the data out of Github, the data will need to be exportable to a Github-style engine.

Currently, the best Github-style engine that has a Wiki, issues, Git Repo hosting, and is free and open source to use is GitLab. The engine is used by and paid for by many major organizations, so it is likely to live on in a stable way.

We will need a complete migration system to move a git repository and all related Github service information of a repository to GitLab.

Things to Scrape

In case of emergency, these are the items we need to grab.

  • Git Repository - Accomplished by github-backup
    • Forked Repositories - Accomplished by github-backup
    • Notes on Commits/Lines of Code - Not supported by github-backup yet. API support was just added for it.
  • Github Gollum Wiki - No tool yet, but just clone the whole thing, and then push it to GitLab.
  • Releases - Tags on Github can have binaries attached. These are of high priority to archive.

List of Repositories

A list of repositories from Github API data are maintained by an archive team member at za3k.com. It scrapes continuously. Public downloads are updated once a day. This list does not include gists.

Github Archive

The metadata generated by the Github API are archived to Google BigQuery every hour by GithubArchive.

It obviously doesn't grab events dating before 2011, so a targeted repository scrape may still be ideal.

But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.

  • Issues + Comments - Accomplished by github-backup
    • Milestones - Github Backup currently does not archive this yet.
    • Labels - Github Backup currently does not archive this yet.
  • Hooks - Needs some kind of tool to archive Github Hooks

External links


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY  · Deathwatch  · Projects
Archiveteam.jpg
Archiving projects APKMirror  · Archive.is  · BetaArchive  · Gmane  · Internet Archive  · It Died  · Megalodon.jp  · OldApps.com  · OldVersion.com  · OSBetaArchive  · TEXTFILES.COM  · The Dead, the Dying & The Damned  · The Mail Archive  · UK Web Archive  · WebCite  · Vaporwave.me
Blogging Blog.pl  · Blogger  · Blogster  · Blogter.hu  · Freeblog.hu  · Fuelmyblog  · Jux  · LiveJournal  · My Opera  · Nolblog.hu  · Open Diary  · ownlog.com  · Posterous  · Powerblogs  · Proust  · Roon  · Splinder  · Tumblr  · Vox  · Weblog.nl  · Windows Live Spaces  · Wordpress.com  · Xanga  · Yahoo! Blog  · Zapd
Cloud hosting/file sharing aDrive  · AnyHub  · Box  · Dropbox  · Docstoc  · Google Drive  · Google Groups Files  · iCloud  · Fileplanet  · LayerVault  · MediaCrush  · MediaFire  · Mega  · MegaUpload  · MobileMe  · OneDrive  · Pomf.se  · RapidShare  · Ubuntu One  · Yahoo! Briefcase
Corporations Apple  · IBM  · Google  · Lycos Europe  · Microsoft  · Yahoo!
Events Arab Spring  · Occupy movement  · Spanish Revolution
Font Repos Google Web Fonts  · GNU FreeFont  · Fontspace
Forums/Message boards 4chan  · Captain Luffy Forums  · College Confidential  · DSLReports  · ESPN Forums  · forums.starwars.com  · HeavenGames  · Invisionfree  · The Classic Horror Film Board  · Yahoo! Messages  · Yahoo! Neighbors  · Yuku.com
Gaming Atomicgamer  · City of Heroes  · Club Nintendo  · CS:GO Lounge  · Desura  · Dota 2 Lounge  · Emulation Zone  · GameMaker Sandbox  · GameTrailers  · Halo  · HLTV.org  · Infinite Crisis  · Minecraft.net  · Player.me  · Playfire  · Steam  · SteamDB  · Warhammer  · Xfire
Image hosting 500px  · AOL Pictures  · Blipfoto  · Blingee  · Canv.as  · Camera+  · Cameroid  · DailyBooth  · Degree Confluence Project  · deviantART  · Demotivalo.net  · Flickr  · Fotoalbum.hu  · Fotolog.com  · Fotopedia  · Frontback  · Geograph Britain and Ireland  · GTF Képhost  · ImageShack  · Imgur  · Inkblazers  · Instagr.am  · Kepfeltoltes.hu  · Kephost.com  · Kephost.hu  · Kepkezelo.com  · Keptarad.hu  · Madden GIFERATOR  · MLKSHK  · Microsoft Clip Art  · Nokia Memories  · noob.hu  · Odysee  · Panoramio  · Photobucket  · Picasa  · Picplz  · PSharing  · Ptch  · puu.sh  · Rawporter  · Relay.im  · ScreenshotsDatabase.com  · Snapjoy  · Streetfiles  · Tabblo  · Trovebox  · TwitPic  · Wallbase  · Wallhaven  · Webshots  · Wikimedia Commons
Knowledge/Wikis arXiv  · Citizendium  · Clipboard.com  · Deletionpedia  · EditThis  · Encyclopedia Dramatica  · Etherpad  · Everything2  · infoAnarchy  · GeoNames  · GNUPedia  · Google Books (Google Books Ngram)  · Horror Movie Database  · Insurgency Wiki  · Knol  · Library Genesis  · Lost Media Wiki  · Neoseeker.com  · Notepad.cc  · Nupedia  · OpenCourseWare  · OpenStreetMap  · Orain  · Pastebin  · Patch.com  · Project Gutenberg  · Puella Magi  · Referata  · Resedagboken  · SongMeanings  · ShoutWiki  · The Internet Movie Database  · TropicalWikis  · Uncyclopedia  · Urban Dictionary  · Webmonkey  · Wikia  · Wikidot  · WikiHow  · Wikkii  · WikiLeaks  · Wikipedia (Simple English Wikipedia)  · Wikispaces  · Wikispot  · Wik.is  · Wiki-Site  · WikiTravel  · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com  · Game Developer Magazine  · Gigaom  · Helium  · JPG Magazine  · Polygamia.pl  · San Fransisco Bay Guardian  · Scoop  · Regretsy  · Yahoo! Voices
Microblogging Heello  · Identi.ca  · Jaiku  · Mommo.hu  · Plurk  · Sina Weibo  · Twitter  · TwitLonger
Music/Audio AOL Music  · Audimated.com  · Cinch  · digCCmixter  · Dogmazic.net  · Earbits  · exfm  · Free Music Archive  · Gogoyoko  · Indaba Music  · Instacast  · Jamendo  · Last.fm  · Music Unlimited  · MOG  · PureVolume  · Reverbnation  · ShareTheMusic  · SoundCloud  · Soundpedia  · This Is My Jam  · TuneWiki  · Twaud.io  · WinAmp
People Aaron Swartz  · Michael S. Hart  · Steve Jobs  · Mark Pilgrim  · Dennis Ritchie  · Len Sassaman Project
Protocols/Infrastructure FTP  · Gopher  · IRC  · Usenet  · World Wide Web
Q&A Askville  · Answerbag  · Answers.com  · Ask.com  · Askalo  · Baidu Knows  · Blurtit  · ChaCha  · Experts Exchange  · Formspring  · GirlsAskGuys  · Google Answers  · Google Baraza  · JustAnswer  · MetaFilter  · Quora  · Retrospring  · StackExchange  · The AnswerBank  · The Internet Oracle  · Uclue  · WikiAnswers  · Yahoo! Answers
Recipes/Food Allrecipes  · Epicurious  · Food.com  · Foodily  · Food Network  · Punchfork  · ZipList
Social bookmarking Addinto  · Backflip  · Balatarin  · BibSonomy  · Bkmrx  · Blinklist  · BlogMarks  · BookmarkSync  · CiteULike  · Connotea  · Delicious  · Designer News  · Digg  · Diigo  · Dir.eccion.es  · Evernote  · Excite Bookmark  · Faves  · Favilous  · folkd  · Freelish  · Getboo  · GiveALink.org  · Gnolia  · Google Bookmarks  · Hacker News  · HeyStaks  · IndianPad  · Kippt  · Knowledge Plaza  · Licorize  · Linkwad  · Menéame  · Microsoft Developer Network  · myVIP  · Mister Wong  · My Web  · Mylink Vault  · Newsvine  · Oneview  · Pearltrees  · Pinboard  · Pocket  · Propeller.com  · Reddit  · sabros.us  · Scloog  · Scuttle  · Simpy  · SiteBar  · Slashdot  · Squidoo  · StumbleUpon  · Twine  · Vizited  · Yummymarks  · Xmarks  · Yahoo! Buzz  · Zootool  · Zotero
Social networks Bebo  · BlackPlanet  · Classmates.com  · Cyworld  · Dogster  · Dopplr  · douban  · Ello  · Facebook  · Flixster  · FriendFeed  · Friendster  · Friends Reunited  · Gaia Online  · Google+  · Habbo  · hi5  · Hyves  · iWiW  · LinkedIn  · Miiverse  · mixi  · MyHeritage  · MyLife  · Myspace  · myVIP  · Netlog  · Odnoklassniki  · Orkut  · Plaxo  · Qzone  · Renren  · Skyrock  · Sonico.com  · Storylane  · Tagged  · tvtag  · Upcoming  · Viadeo  · Vkontakte  · WeeWorld  · Weibo  · Wretch  · Yahoo! Groups  · Yahoo! Stars India  · Yahoo! Upcoming  · more sites...
Shopping/Retail Alibaba  · AliExpress  · Amazon  · Apple Store  · eBay  · Printfection  · RadioShack  · Sears  · Target  · The Book Depository  · ThinkGeek  · Walmart
Software/code hosting Android Development  · Alioth  · Assembla  · BerliOS  · Betavine  · Bitbucket  · BountySource  · Codecademy  · CodePlex  · Freepository  · Free Software Foundation  · GNU Savannah  · GitHost  · GitHub  · GitHub Downloads  · Gitorious  · Gna!  · Google Code  · ibiblio  · java.net  · JavaForge  · KnowledgeForge  · Launchpad  · LuaForge  · Maemo  · mozdev  · OSOR.eu  · OW2 Consortium  · Openmoko  · OpenSolaris  · Ourproject.org  · Ovi Store  · Project Kenai  · RubyForge  · SEUL.org  · SourceForge  · Stypi  · TestFlight  · tigris.org  · Transifex  · TuxFamily  · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent  · EZTV  · isoHunt  · KickassTorrents  · The Pirate Bay  · Torrentz
Video hosting Academic Earth  · Blip.tv  · Epic  · Google Video  · Justin.tv  · Niconico  · Nokia Trailers  · Qwiki  · Skillfeed  · Stickam  · TED Talks  · Ticker.tv  · Twitch.tv  · Ustream  · Videoplayer.hu  · Viddler  · Viddy  · Vimeo  · Vstreamers  · Yahoo! Video  · YouTube  · Famous Internet videos (Me at the zoo)
Web hosting Angelfire  · Brace.io  · BT Internet  · CableAmerica Personal Web Space  · Claranet Netherlands Personal Web Pages  · Comcast Personal Web Pages  · Extra.hu  · FortuneCity  · Free ProHosting  · GeoCities (patch)  · Google Business Sitebuilder  · Google Sites  · Internet Centrum  · MBinternet  · MSN TV  · Nwnyet  · Parodius Networking  · Prodigy.net  · Saunalahti Iso G  · Swipnet  · Telenor  · Tripod  · University of Michigan personal webpages  · Verizon Mysite  · Verizon Personal Web Space  · Webzdarma  · Virgin Media
Web applications Mailman  · MediaWiki  · phpBB  · Simple Machines Forum  · vBulletin
Other 800notes  · AOL  · Akoha  · Ancestry.com  · April Fools' Day  · Amplicate  · AutoAdmit  · Bre.ad  · Circavie  · Cobook  · Co.mments  · Countdown  · Distill  · Dmoz  · Easel  · Eircode  · Electronic Frontier Foundation  · FanFiction.Net  · Feedly  · Ficlets  · Forrst  · FunnyExam.com  · FurAffinity  · Google Helpouts  · Google Moderator  · Google Reader  · ICQmail  · IFTTT  · Jajah  · JuniorNet  · Lulu Poetry  · Mobile Phone Applications  · Mochi Media  · Mozilla Firefox  · MyBlogLog  · NBII  · Neopets  · Quantcast  · Quizilla  · Salon Table Talk  · Shutdownify  · Slidecast  · SOPA blackout pages  · starwars.yahoo.com  · TechNet  · Toshiba Support  · Volán  · Widgetbox  · Windows Technical Preview  · Wunderlist  · Zoocasa
Information A Million Ways to Die on the Web  · Backup Tips  · Cheap storage  · Collecting items randomly  · Data compression algorithms and tools  · Dev  · Discovery Data  · DOS Floppies  · Fortress of Solitude  · Keywords  · Naughty List  · Nightmare Projects  · Rescuing floppy disks  · Rescuing optical media  · Site exploration  · The WARC Ecosystem  · Working with ARCHIVE.ORG
Projects ArchiveCorps  · Audit2014  · Emularity  · Faceoff  · FlickrFckr  · Froogle  · INTERNETARCHIVE.BAK (Internet Archive Census)  · IRC Quotes  · JSMESS  · JSVLC  · Just Solve the Problem  · NewsGrabber  · Project Newsletter  · Valhalla  · Web Roasting (ISP Hosting  · University Web Hosting)  · Woohoo
Tools ArchiveBot  · ArchiveTeam Warrior (Tracker)  · Google Takeout  · HTTrack  · Video downloaders  · Wget (Lua  · WARC)
Teams Bibliotheca Anonoma  · LibreTeam  · URLTeam  · Yahoo Video Warroom  · WikiTeam
About Archive Team Introduction  · Philosophy  · Who We Are  · Our stance on robots.txt  · Why Back Up?  · Software  · Formats  · Storage Media  · Recommended Reading  · Films and documentaries about archiving  · Talks  · In The Media  · FAQ