Large-scale license text recognition (internship)

From Software Heritage Wiki
Jump to: navigation, search

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: A number of free/open source software (FOSS) tools are available to automatically detect software licenses declared in source code files, e.g., Fossology, ScanCode, Ninka. Most of them rely on carefully maintained heuristics that have been tuned over many years to detect licenses (FOSS or otherwise) that can be found in the wild. Only relatively recently machine-learning techniques have been applied to the license-detection problem, in prototypes like FOSSologyML. The goal of this internship is to apply machine-learning techniques to a limited sub-problem of license-detection, i.e., recognizing full license texts as they are commonly found in top-level files such as `LICENSE`, `COPYING`, etc., at the scale of Software Heritage. All such files will be extracted from the archive, and suitable machine learning models will be designed and tested on the obtained corpus.

Desirable skills to obtain this internship:

  • Python development
  • machine learning training and experience
  • working knowledge of one or more machine learning frameworks (e.g., Keras, TensorFlow, scikit-learn)

Will be considered a plus:

  • natural language processing (NLP) training and experience

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Stefano Zacchiroli <zack@upsilon.cc>

See also