Difference between revisions of "GitHub"

From Archiveteam
Jump to navigation Jump to search
(Relabel to be clearer that less data is present)
(3 intermediate revisions by 3 users not shown)
Line 3: Line 3:
| logo = GitHub_logo.png
| logo = GitHub_logo.png
| image = GitHub 1303511667338.png
| image = GitHub 1303511667338.png
| description = A screen shot of the GitHub home page taken on 07 August 2014.
| description = A screen shot of the GitHub home page taken on 2015-11-08
| URL = {{url|1=https://github.com/|2=GitHub}}
| URL = {{url|1=https://github.com/|2=GitHub}}
| project_status = {{online}}
| project_status = {{online}}
Line 14: Line 14:


As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories [https://github.com/search?type=Repositories&q=fork%3Atrue 1,117,147 public repositories] are forks, which greatly reduces the amount of data required to archive it.
As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories [https://github.com/search?type=Repositories&q=fork%3Atrue 1,117,147 public repositories] are forks, which greatly reduces the amount of data required to archive it.
As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.


== Backup tools ==
== Backup tools ==
Line 41: Line 42:
* '''Github Gollum Wiki''' - No tool yet, but just clone the whole thing, and then push it to GitLab.
* '''Github Gollum Wiki''' - No tool yet, but just clone the whole thing, and then push it to GitLab.
* '''Releases''' - Tags on Github can have binaries attached. These are of high priority to archive.
* '''Releases''' - Tags on Github can have binaries attached. These are of high priority to archive.
=== List of Repositories ===
A list of repositories from Github API data are maintained by an archive team member at [https://za3k.com/github za3k.com]. It scrapes continuously. Public downloads are updated once a day. This list does not include gists.


=== Github Archive ===
=== Github Archive ===
Line 48: Line 53:
It obviously doesn't grab events '''dating before 2011''', so a targeted repository scrape may still be ideal.
It obviously doesn't grab events '''dating before 2011''', so a targeted repository scrape may still be ideal.


But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, which would be lower bandwidth. However, we need to create such an export script for it when the time comes.
But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.


* Issues + Comments - Accomplished by [http://github.com/joeyh/github-backup github-backup]
* Issues + Comments - Accomplished by [http://github.com/joeyh/github-backup github-backup]

Revision as of 07:15, 23 November 2015

GitHub
GitHub logo
A screen shot of the GitHub home page taken on 2015-11-08
A screen shot of the GitHub home page taken on 2015-11-08
URL GitHub[IAWcite.todayMemWeb]
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)
See also GitHub Downloads

GitHub is a software repository powered by Git. Does not seem to have any site issues, often 24 hours uptime (see site status). Looks pretty sunny at the moment, but when disaster strikes it would be a problem archiving the private repositories.

As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories 1,117,147 public repositories are forks, which greatly reduces the amount of data required to archive it. As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.

Backup tools

"git clone" is the simplest one. However, it does not get some project data that is not stored in git, including issue reports, comments, pull requests.

github-backup runs in a git repository and chases down that information, committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well.

githubarchive.org is creating an archive of the github "timeline", that is, all events like git pushes, forks, created issues, pull requests, … .

Github Replacement Engines

If we ever have to archive the data out of Github, the data will need to be exportable to a Github-style engine.

Currently, the best Github-style engine that has a Wiki, issues, Git Repo hosting, and is free and open source to use is GitLab. The engine is used by and paid for by many major organizations, so it is likely to live on in a stable way.

We will need a complete migration system to move a git repository and all related Github service information of a repository to GitLab.

Things to Scrape

In case of emergency, these are the items we need to grab.

  • Git Repository - Accomplished by github-backup
    • Forked Repositories - Accomplished by github-backup
    • Notes on Commits/Lines of Code - Not supported by github-backup yet. API support was just added for it.
  • Github Gollum Wiki - No tool yet, but just clone the whole thing, and then push it to GitLab.
  • Releases - Tags on Github can have binaries attached. These are of high priority to archive.

List of Repositories

A list of repositories from Github API data are maintained by an archive team member at za3k.com. It scrapes continuously. Public downloads are updated once a day. This list does not include gists.

Github Archive

The metadata generated by the Github API are archived to Google BigQuery every hour by GithubArchive.

It obviously doesn't grab events dating before 2011, so a targeted repository scrape may still be ideal.

But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.

  • Issues + Comments - Accomplished by github-backup
    • Milestones - Github Backup currently does not archive this yet.
    • Labels - Github Backup currently does not archive this yet.
  • Hooks - Needs some kind of tool to archive Github Hooks

External links