Difference between revisions of "Language and infrastructure for analyzing the archive (internship)"

Revision as of 10:48, 16 December 2020

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage archive is structured as a graph (specifically, a Merkle DAG) and is huge: tens of billion nodes, hundreds of billion edges. The graph exhibits a lot of sharing: the same source code files and directories can be reached starting from many different commits (e.g., different commits in the same repository), and the same commits can be reached starting from many different repositories (e.g., repositories that are "forks" of one another). When analyzing source code at a very large-scale (e.g, all the commits of the same large repository, or even all projects hosted on GitHub) it is pointless, and a waste of resources, to re-analyze source code artifacts already analyzed in the past, and encountered again in the future due to sharing in the graph.

The goal of this internship is to design and implement a prototype platform (similar in spirit to Boa) that allows to describe empirical experiments to be run on the Software Heritage archive, exploiting artifact sharing as a way to speed up the analysis. The platform will constitute of a simple language to describe experiments (e.g., "start from these repositories and run this script on all files in each commit") and of a runtime implementing the language that transparently handles caching of previous results. As a stretch goal: the runtime will delegate actual compute to multiple workers, running either on a single machine or distributed over a cluster.

If successfully implemented, the internship will conclude with a demonstration (e.g., in the form of a paper) benchmarking in practice the performance advantages of the proposed approach over a naive implementation.

Desirable skills to obtain this internship:

Python development
experience with functional programming

Will be considered a plus:

experience with programming language theory and implementation
experience with the MapReduce programming model

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

Antoine Pietri
Stefano Zacchiroli (main contact for this internship: zack@upsilon.cc)

Difference between revisions of "Language and infrastructure for analyzing the archive (internship)"

Revision as of 10:48, 16 December 2020

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Revision as of 10:46, 16 December 2020 (view source) StefanoZacchiroli (talk \| contribs) m (StefanoZacchiroli moved page Language and runtime for efficient Software Heritage analysis (internship) to Language and infrastructure for analyzing the Software Heritage archive without leaving a redirect) ← Older edit	Revision as of 10:48, 16 December 2020 (view source) StefanoZacchiroli (talk \| contribs) m (StefanoZacchiroli moved page Language and infrastructure for analyzing the Software Heritage archive to Language and infrastructure for analyzing the Software Heritage archive (internship)) Newer edit →
(No difference)