Fine-grained tracking of source code provenance (internship)

From Software Heritage Wiki
Revision as of 10:33, 2 March 2021 by StefanoZacchiroli (talk | contribs) (add IRC nicknames)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild. Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past. The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code. Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.

Desirable skills to obtain this internship:

  • Python development

Will be considered a plus:

  • experience with source code indexing and/or search
  • experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Guillaume Rousseau <>
  • Stefano Zacchiroli <> (zack on IRC)

See also