Difference between revisions of "Language and infrastructure for analyzing the archive (internship)"

From Software Heritage Wiki
Jump to navigation Jump to search
m
 
(4 intermediate revisions by 2 users not shown)
Line 18: Line 18:
  
 
The goal of this internship is to design and implement a prototype platform
 
The goal of this internship is to design and implement a prototype platform
(similar in spirit to [http://web.cs.iastate.edu/~design/papers/ICSE-13/icse13.pdf Boa])
+
(similar in spirit to [http://design.cs.iastate.edu/papers/ICSE-13/icse13.pdf Boa])
 
that allows to describe empirical experiments to be run on the Software
 
that allows to describe empirical experiments to be run on the Software
 
Heritage archive, exploiting artifact sharing as a way to speed up the
 
Heritage archive, exploiting artifact sharing as a way to speed up the
Line 41: Line 41:
  
 
|mentors=
 
|mentors=
* [https://koin.fr/ Antoine Pietri]
+
* Stefano Zacchiroli <zack@upsilon.cc> (Zack on [[Matrix]])
* [https://upsilon.cc/~zack Stefano Zacchiroli] (main contact for this internship: zack@upsilon.cc)
 
 
}}
 
}}
  
 
[[Category:Available internship]]
 
[[Category:Available internship]]

Latest revision as of 15:09, 4 February 2024

(voir aussi: version française du sujet)


Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.

Description: The Software Heritage archive is structured as a graph (specifically, a Merkle DAG) and is huge: tens of billion nodes, hundreds of billion edges. The graph exhibits a lot of sharing: the same source code files and directories can be reached starting from many different commits (e.g., different commits in the same repository), and the same commits can be reached starting from many different repositories (e.g., repositories that are "forks" of one another). When analyzing source code at a very large-scale (e.g, all the commits of the same large repository, or even all projects hosted on GitHub) it is pointless, and a waste of resources, to re-analyze source code artifacts already analyzed in the past, and encountered again in the future due to sharing in the graph.

The goal of this internship is to design and implement a prototype platform (similar in spirit to Boa) that allows to describe empirical experiments to be run on the Software Heritage archive, exploiting artifact sharing as a way to speed up the analysis. The platform will constitute of a simple language to describe experiments (e.g., "start from these repositories and run this script on all files in each commit") and of a runtime implementing the language that transparently handles caching of previous results. As a stretch goal: the runtime will delegate actual compute to multiple workers, running either on a single machine or distributed over a cluster.

If successfully implemented, the internship will conclude with a demonstration (e.g., in the form of a paper) benchmarking in practice the performance advantages of the proposed approach over a naive implementation.

Desirable skills to obtain this internship:

  • Python development
  • experience with functional programming

Will be considered a plus:

  • experience with programming language theory and implementation
  • experience with the MapReduce programming model

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Stefano Zacchiroli <zack@upsilon.cc> (Zack on Matrix)

See also