https://wiki.archiveteam.org/api.php?action=feedcontributions&user=Za3k&feedformat=atomArchiveteam - User contributions [en]2024-03-29T13:17:02ZUser contributionsMediaWiki 1.37.1https://wiki.archiveteam.org/index.php?title=User:Za3k&diff=24764User:Za3k2015-11-29T20:34:12Z<p>Za3k: </p>
<hr />
<div>20-somethings programmer in San Francisco. Tutorials on archiving and system administration is on my [https://blog.za3k.com/ blog]. A list of github repositories is linked at [[Github]].</div>Za3khttps://wiki.archiveteam.org/index.php?title=User:Za3k&diff=24763User:Za3k2015-11-29T20:33:48Z<p>Za3k: About me</p>
<hr />
<div>20-somethings programmer in San Francisco. Tutorials on archiving and system administration is on my [[https://blog.za3k.com/ blog]]. A list of github repositories is linked at [[Github]].</div>Za3khttps://wiki.archiveteam.org/index.php?title=User_talk:JesseW&diff=24761User talk:JesseW2015-11-29T18:53:30Z<p>Za3k: github summary page</p>
<hr />
<div>No, you're mistaken, the sample size being 1000 is just confusing. Two of the numbers are 4.3M (per-repo), because the total sample of 1000 repositories is 4300M = 4.3G. <br />
<br />
I have no idea how to do communication on a wiki, is this right?</div>Za3khttps://wiki.archiveteam.org/index.php?title=GitHub&diff=24740GitHub2015-11-23T07:15:37Z<p>Za3k: Relabel to be clearer that less data is present</p>
<hr />
<div>{{Infobox project<br />
| title = GitHub<br />
| logo = GitHub_logo.png<br />
| image = GitHub 1303511667338.png<br />
| description = A screen shot of the GitHub home page taken on 2015-11-08<br />
| URL = {{url|1=https://github.com/|2=GitHub}}<br />
| project_status = {{online}}<br />
| archiving_status = {{nosavedyet}}<br />
}}<br />
<br />
:''See also [[GitHub Downloads]]''<br />
<br />
'''GitHub''' is a software repository powered by Git. Does not seem to have any site issues, often 24 hours uptime (see [http://status.github.com/ site status]). Looks pretty sunny at the moment, but when disaster strikes it would be a problem archiving the private repositories.<br />
<br />
As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories [https://github.com/search?type=Repositories&q=fork%3Atrue 1,117,147 public repositories] are forks, which greatly reduces the amount of data required to archive it.<br />
As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.<br />
<br />
== Backup tools ==<br />
<br />
"git clone" is the simplest one. However, it does not get some project data that is not stored in git, including issue reports, comments, pull requests. <br />
<br />
[http://github.com/joeyh/github-backup github-backup] runs in a git repository and chases down that information, <br />
committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well.<br />
<br />
[http://www.githubarchive.org/ githubarchive.org] is creating an archive of the github "timeline", that is, all events like git pushes, forks, created issues, pull requests, … .<br />
<br />
== Github Replacement Engines ==<br />
<br />
If we ever have to archive the data out of Github, the data will need to be exportable to a Github-style engine.<br />
<br />
Currently, the best Github-style engine that has a Wiki, issues, Git Repo hosting, and is free and open source to use is [http://gitlab.com GitLab]. The engine is used by and paid for by many major organizations, so it is likely to live on in a stable way.<br />
<br />
We will need a complete migration system to move a git repository and all related Github service information of a repository to GitLab.<br />
<br />
=== Things to Scrape ===<br />
<br />
In case of emergency, these are the items we need to grab.<br />
<br />
* Git Repository - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** Forked Repositories - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** '''Notes on Commits/Lines of Code''' - Not supported by github-backup yet. API support was just added for it.<br />
* '''Github Gollum Wiki''' - No tool yet, but just clone the whole thing, and then push it to GitLab.<br />
* '''Releases''' - Tags on Github can have binaries attached. These are of high priority to archive.<br />
<br />
=== List of Repositories ===<br />
<br />
A list of repositories from Github API data are maintained by an archive team member at [https://za3k.com/github za3k.com]. It scrapes continuously. Public downloads are updated once a day. This list does not include gists.<br />
<br />
=== Github Archive ===<br />
<br />
The metadata generated by the Github API are archived to Google BigQuery every hour by [https://www.githubarchive.org/ GithubArchive]. <br />
<br />
It obviously doesn't grab events '''dating before 2011''', so a targeted repository scrape may still be ideal.<br />
<br />
But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.<br />
<br />
* Issues + Comments - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** '''Milestones''' - ''Github Backup currently does not archive this yet.''<br />
** '''Labels''' - ''Github Backup currently does not archive this yet.''<br />
* '''Hooks''' - Needs some kind of tool to archive Github Hooks<br />
<br />
== External links ==<br />
* {{url|1=https://github.com/|2=GitHub}}<br />
<br />
{{Navigation box}}</div>Za3khttps://wiki.archiveteam.org/index.php?title=GitHub&diff=24739GitHub2015-11-23T07:14:45Z<p>Za3k: </p>
<hr />
<div>{{Infobox project<br />
| title = GitHub<br />
| logo = GitHub_logo.png<br />
| image = GitHub 1303511667338.png<br />
| description = A screen shot of the GitHub home page taken on 2015-11-08<br />
| URL = {{url|1=https://github.com/|2=GitHub}}<br />
| project_status = {{online}}<br />
| archiving_status = {{nosavedyet}}<br />
}}<br />
<br />
:''See also [[GitHub Downloads]]''<br />
<br />
'''GitHub''' is a software repository powered by Git. Does not seem to have any site issues, often 24 hours uptime (see [http://status.github.com/ site status]). Looks pretty sunny at the moment, but when disaster strikes it would be a problem archiving the private repositories.<br />
<br />
As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories [https://github.com/search?type=Repositories&q=fork%3Atrue 1,117,147 public repositories] are forks, which greatly reduces the amount of data required to archive it.<br />
As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.<br />
<br />
== Backup tools ==<br />
<br />
"git clone" is the simplest one. However, it does not get some project data that is not stored in git, including issue reports, comments, pull requests. <br />
<br />
[http://github.com/joeyh/github-backup github-backup] runs in a git repository and chases down that information, <br />
committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well.<br />
<br />
[http://www.githubarchive.org/ githubarchive.org] is creating an archive of the github "timeline", that is, all events like git pushes, forks, created issues, pull requests, … .<br />
<br />
== Github Replacement Engines ==<br />
<br />
If we ever have to archive the data out of Github, the data will need to be exportable to a Github-style engine.<br />
<br />
Currently, the best Github-style engine that has a Wiki, issues, Git Repo hosting, and is free and open source to use is [http://gitlab.com GitLab]. The engine is used by and paid for by many major organizations, so it is likely to live on in a stable way.<br />
<br />
We will need a complete migration system to move a git repository and all related Github service information of a repository to GitLab.<br />
<br />
=== Things to Scrape ===<br />
<br />
In case of emergency, these are the items we need to grab.<br />
<br />
* Git Repository - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** Forked Repositories - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** '''Notes on Commits/Lines of Code''' - Not supported by github-backup yet. API support was just added for it.<br />
* '''Github Gollum Wiki''' - No tool yet, but just clone the whole thing, and then push it to GitLab.<br />
* '''Releases''' - Tags on Github can have binaries attached. These are of high priority to archive.<br />
<br />
=== Repository Metadata ===<br />
<br />
A list of repositories from Github API data are maintained by an archive team member at [https://za3k.com/github za3k.com]. It scrapes continuously. Public downloads are updated once a day.<br />
<br />
=== Github Archive ===<br />
<br />
The metadata generated by the Github API are archived to Google BigQuery every hour by [https://www.githubarchive.org/ GithubArchive]. <br />
<br />
It obviously doesn't grab events '''dating before 2011''', so a targeted repository scrape may still be ideal.<br />
<br />
But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.<br />
<br />
* Issues + Comments - Accomplished by [http://github.com/joeyh/github-backup github-backup]<br />
** '''Milestones''' - ''Github Backup currently does not archive this yet.''<br />
** '''Labels''' - ''Github Backup currently does not archive this yet.''<br />
* '''Hooks''' - Needs some kind of tool to archive Github Hooks<br />
<br />
== External links ==<br />
* {{url|1=https://github.com/|2=GitHub}}<br />
<br />
{{Navigation box}}</div>Za3k