Language and infrastructure for analyzing the archive (internship)

From Software Heritage Wiki
Revision as of 11:37, 23 September 2021 by StefanoZacchiroli (talk | contribs) (mark internship as available)
Jump to: navigation, search

(voir aussi: version française du sujet)


Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage archive is structured as a graph (specifically, a Merkle DAG) and is huge: tens of billion nodes, hundreds of billion edges. The graph exhibits a lot of sharing: the same source code files and directories can be reached starting from many different commits (e.g., different commits in the same repository), and the same commits can be reached starting from many different repositories (e.g., repositories that are "forks" of one another). When analyzing source code at a very large-scale (e.g, all the commits of the same large repository, or even all projects hosted on GitHub) it is pointless, and a waste of resources, to re-analyze source code artifacts already analyzed in the past, and encountered again in the future due to sharing in the graph.

The goal of this internship is to design and implement a prototype platform (similar in spirit to Boa) that allows to describe empirical experiments to be run on the Software Heritage archive, exploiting artifact sharing as a way to speed up the analysis. The platform will constitute of a simple language to describe experiments (e.g., "start from these repositories and run this script on all files in each commit") and of a runtime implementing the language that transparently handles caching of previous results. As a stretch goal: the runtime will delegate actual compute to multiple workers, running either on a single machine or distributed over a cluster.

If successfully implemented, the internship will conclude with a demonstration (e.g., in the form of a paper) benchmarking in practice the performance advantages of the proposed approach over a naive implementation.

Desirable skills to obtain this internship:

  • Python development
  • experience with functional programming

Will be considered a plus:

  • experience with programming language theory and implementation
  • experience with the MapReduce programming model

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Antoine Pietri (seirl on IRC)
  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)

See also