Fine-grained tracking of source code provenance (internship)

From Software Heritage Wiki
Jump to navigation Jump to search

Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.

Description: Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild. Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past. The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code. Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.

Desirable skills to obtain this internship:

  • Python development

Will be considered a plus:

  • experience with source code indexing and/or search
  • experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Guillaume Rousseau <guillaume.rousseau@univ-paris-diderot.fr>
  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)

See also