Integrate Software Heritage and GHTorrent (internship): Difference between revisions
No edit summary |
No edit summary |
||
Line 22: | Line 22: | ||
[[Category:Available internship]] | [[Category:Available internship]] | ||
Revision as of 17:17, 16 February 2020
Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.
Description: Software Heritage is building the largest source code repository in existence, initially populated with all projects from GitHub. The GHTorrent project collects and archives data from the GitHub API, including issues, teams, pull requests and commits. The purpose of this internship is to integrate the construction processes of the respective datasets. The goal is to allow the two projects to be updated independently but also create a fusion point where updates from either project's database are integrated into a centralized, query-able archive in a streaming fashion.
Desirable skills to obtain this internship:
- knowledge of streaming data technologies
- familiarity of the internals of Git
- familiarity with the GitHub API
- working knowledge of any/more of Python, Kafka, Postgres, MySQL, and MongoDB would be a plus
Workplace: on site at Inria Paris (contact mentors for remote opportunities)
Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.
Internship mentors:
- Georgios Gousios <g.gousios@tudelft.nl>
- Stefano Zacchiroli <zack@upsilon.cc>