Large-scale progamming language detection (internship)

From Software Heritage Wiki
Revision as of 21:00, 27 January 2018 by StefanoZacchiroli (talk | contribs) (add skeleton for french translation)
Jump to navigation Jump to search

Detection de langage de programmation à grande échelle

(english description follows)

Contexte: Software Heritage, projet de recherche de grande envergure ayant comme but la récupération, l'archivage à très long terme, et le partage de la totalité du Logiciel Libre publiquement accessible en format code source.

Description: TODO

Connaissances souhaitées pour accéder au stage: TODO

Établissement d'accueil: Inria Paris

Environnement: vous serez en immersion totale avec l'équipe qui construit l'archive de Software Heritage, et vous aurez la possibilité d'observer de près la construction d'un projet d'envergure mondiale.

Encadrants:

  • Roberto Di Cosmo <roberto@dicosmo.org>
  • Stefano Zacchiroli <zack@upsilon.cc>


Large-scale programming language detection

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage archive has assembled the largest public collection of source code to date: about 4 billion unique files and 1 billion unique commits coming from more than 70 million development projects. Detecting the programming language of each individual source code file at this scale is no easy feat.

The goal of this internship is, first, to review state-of-the-art techniques and research results on large-scale detection of programming languages, to assess which ones are good candidates for application to the Software Heritage archive; second, practically experiment with (a subset of) the chosen techniques on (a subset of) the Software Heritage archive to quantitatively compare their effectiveness.

Desirable skills to obtain this internship:

  • Python development
  • basic statistics for data analysis
  • working knowledge of one of the following would be a plus:
    • natural language processing
    • machine learning
    • GNU R

Workplace: Inria Paris

Environnement: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the ultimate source code archive.

Internship mentors:

  • Roberto Di Cosmo <roberto@dicosmo.org>
  • Stefano Zacchiroli <zack@upsilon.cc>