Difference between revisions of "Code scanner (internship)"

From Software Heritage Wiki
Jump to navigation Jump to search
(mark as completed)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
{{Internship
 
{{Internship
|description=
+
|description=Companies shipping software as part of their products review
Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild.
+
the source code they ship against databases of known FOSS components to make
Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past.
+
sure they are not shipping unexpected pieces of code. The goal of this
The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code.
+
internship is developing a source code scanner that will be run on a software
Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.
+
project to determine which parts of it are already known/archived in the
 +
Software Heritage archive. The scanning should be as efficient as possible and
 +
the results should be displayed in simple graphical ways (e.g.,  
 +
[https://en.wikipedia.org/wiki/Treemapping treemaps]).
  
 
|skills=
 
|skills=
 
* Python development
 
* Python development
 
Will be considered a plus:
 
* experience with source code indexing and/or search
 
* experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)
 
  
 
|mentors=
 
|mentors=
Line 18: Line 17:
 
}}
 
}}
  
[[Category:Ongoing internship]]
+
[[Category:Completed internship]]

Latest revision as of 09:43, 14 December 2020

Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.

Description: Companies shipping software as part of their products review the source code they ship against databases of known FOSS components to make sure they are not shipping unexpected pieces of code. The goal of this internship is developing a source code scanner that will be run on a software project to determine which parts of it are already known/archived in the Software Heritage archive. The scanning should be as efficient as possible and the results should be displayed in simple graphical ways (e.g., treemaps).

Desirable skills to obtain this internship:

  • Python development

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Guillaume Rousseau <guillaume.rousseau@univ-paris-diderot.fr>
  • Stefano Zacchiroli <zack@upsilon.cc>

See also