Difference between revisions of "Large-scale progamming language detection (internship)"

From Software Heritage Wiki
Jump to navigation Jump to search
(add programming language detection internship)
 
m (reread)
Line 5: Line 5:
  
 
'''Description''':
 
'''Description''':
The [[Software Heritage archive]] has already assembled the largest collection of
+
The [[Software Heritage archive]] has assembled the largest public collection of
 
source code to date: about 4 billion unique files and 1 billion unique commits
 
source code to date: about 4 billion unique files and 1 billion unique commits
 
coming from more than 70 million development projects.
 
coming from more than 70 million development projects.
Line 12: Line 12:
 
scale is no easy feat. The goal of this internship is, first, to review
 
scale is no easy feat. The goal of this internship is, first, to review
 
state-of-the-art techniques and research results on large-scale detection of
 
state-of-the-art techniques and research results on large-scale detection of
programming languages, to assess which ones are good fits for Software Heritage;
+
programming languages, to assess which ones are good candidates for application
second, practically experiment with (a subset of) the chosen techniques to
+
to the Software Heritage archive; second, practically experiment with (a subset
 +
of) the chosen techniques on (a subset of) the Software Heritage archive to
 
quantitatively compare their effectiveness.
 
quantitatively compare their effectiveness.
  

Revision as of 11:31, 20 January 2018

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage archive has assembled the largest public collection of source code to date: about 4 billion unique files and 1 billion unique commits coming from more than 70 million development projects.

Detecting the programming language of each individual source code file at this scale is no easy feat. The goal of this internship is, first, to review state-of-the-art techniques and research results on large-scale detection of programming languages, to assess which ones are good candidates for application to the Software Heritage archive; second, practically experiment with (a subset of) the chosen techniques on (a subset of) the Software Heritage archive to quantitatively compare their effectiveness.

Desirable skills to obtain this internship:

  • information retrieval
  • knowledge modeling and representation
  • markup languages and manipulation of semi-structured data (HTML, XML, etc.)
  • working knowledge of automatic classification and machine learning would be a plus

Workplace: Inria Paris

Environnement: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the ultimate source code archive.

Internship mentors:

  • Roberto Di Cosmo <roberto@dicosmo.org>
  • Stefano Zacchiroli <zack@upsilon.cc>