Difference between revisions of "Integrate Software Heritage and GHTorrent (internship)"

From Software Heritage Wiki
Jump to: navigation, search
(new internship topic, initially proposed by Georgios Gousios)
 
(add IRC nicknames)
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Integrating Software Heritage and GHTorrent ==
+
{{Internship
 
+
|description=Software Heritage is building the largest source code repository in existence,
'''Context''': [https://www.softwareheritage.org/ Software Heritage] is an
+
initially populated with all projects from GitHub.
ambitious research project whose goal is to collect, preserve in the very long
+
The [http://ghtorrent.org/ GHTorrent] project collects
term, and share the whole publicly accessible Free/Open Source Software
 
(FOSS) in source code form.
 
 
 
'''Description''':
 
Software Heritage is building the largest source code repository in existence,
 
initially populated with all projects from GitHub. The GHTorrent project collects
 
 
and archives data from the GitHub API, including issues, teams, pull requests and
 
and archives data from the GitHub API, including issues, teams, pull requests and
 
commits. The purpose of this internship is to integrate the construction processes
 
commits. The purpose of this internship is to integrate the construction processes
Line 16: Line 10:
 
centralized, query-able archive in a streaming fashion.
 
centralized, query-able archive in a streaming fashion.
  
'''Desirable skills''' to obtain this internship:
+
|skills=
 
* knowledge of streaming data technologies
 
* knowledge of streaming data technologies
 
* familiarity of the internals of Git
 
* familiarity of the internals of Git
Line 22: Line 16:
 
* working knowledge of any/more of Python, Kafka, Postgres, MySQL, and MongoDB would be a plus
 
* working knowledge of any/more of Python, Kafka, Postgres, MySQL, and MongoDB would be a plus
  
'''Workplace''': Inria Paris
+
|mentors=
 
 
'''Environnement''': you will work shoulder to shoulder with all members of the
 
Software Heritage team, and you will have a chance to witness from within the
 
construction of the ultimate source code archive.
 
 
 
'''Internship mentors''':
 
 
* Georgios Gousios <g.gousios@tudelft.nl>
 
* Georgios Gousios <g.gousios@tudelft.nl>
* Stefano Zacchiroli <zack@upsilon.cc>
+
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
 
+
}}
  
 
[[Category:Available internship]]
 
[[Category:Available internship]]
[[Category:Internship]]
 
[[Category:Lang:English]]
 

Revision as of 10:34, 2 March 2021

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: Software Heritage is building the largest source code repository in existence, initially populated with all projects from GitHub. The GHTorrent project collects and archives data from the GitHub API, including issues, teams, pull requests and commits. The purpose of this internship is to integrate the construction processes of the respective datasets. The goal is to allow the two projects to be updated independently but also create a fusion point where updates from either project's database are integrated into a centralized, query-able archive in a streaming fashion.

Desirable skills to obtain this internship:

  • knowledge of streaming data technologies
  • familiarity of the internals of Git
  • familiarity with the GitHub API
  • working knowledge of any/more of Python, Kafka, Postgres, MySQL, and MongoDB would be a plus

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Georgios Gousios <g.gousios@tudelft.nl>
  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)

See also