Large-scale license text recognition (internship)
Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.
Description: A number of free/open source software (FOSS) tools are available to automatically detect software licenses declared in source code files, e.g., Fossology, ScanCode, Ninka.
Most of them rely on carefully maintained heuristics that have been tuned over many years to detect licenses (FOSS or otherwise) that can be found in the wild.
Only relatively recently machine-learning techniques have been applied to the license-detection problem, in prototypes like FOSSologyML.
The goal of this internship is to apply machine-learning techniques to a limited sub-problem of license-detection, i.e., recognizing full license texts as they are commonly found in top-level files such as LICENSE
, COPYING
, etc., at the scale of Software Heritage.
All such files will be extracted from the archive, and suitable machine learning models will be designed and tested on the obtained corpus.
Desirable skills to obtain this internship:
- Python development
- machine learning training and experience
- working knowledge of one or more machine learning frameworks (e.g., Keras, TensorFlow, scikit-learn)
Will be considered a plus:
- natural language processing (NLP) training and experience
Workplace: on site at Inria Paris (contact mentors for remote opportunities)
Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.
Internship mentors:
- Stefano Zacchiroli <zack@upsilon.cc>