Difference between revisions of "Code scanner (internship)"

From Software Heritage Wiki
Jump to navigation Jump to search
(port markup to internship template)
Line 1: Line 1:
 
{{Internship
 
{{Internship
|description=Companies shipping software as part of their products review
+
|description=
the source code they ship against databases of known FOSS components to make
+
Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild.
sure they are not shipping unexpected pieces of code. The goal of this
+
Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past.
internship is developing a source code scanner that will be run on a software
+
The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code.
project to determine which parts of it are already known/archived in the
+
Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.
Software Heritage archive. The scanning should be as efficient as possible and
 
the results should be displayed in simple graphical ways (e.g.,  
 
[https://en.wikipedia.org/wiki/Treemapping treemaps]).
 
  
 
|skills=
 
|skills=
 
* Python development
 
* Python development
 +
 +
Will be considered a plus:
 +
* experience with source code indexing and/or search
 +
* experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)
  
 
|mentors=
 
|mentors=

Revision as of 21:12, 16 February 2020

Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.

Description: Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild. Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past. The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code. Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.

Desirable skills to obtain this internship:

  • Python development

Will be considered a plus:

  • experience with source code indexing and/or search
  • experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Guillaume Rousseau <guillaume.rousseau@univ-paris-diderot.fr>
  • Stefano Zacchiroli <zack@upsilon.cc>

See also