Source code search engine prototype (internship)
Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.
Description: The current archive search engine supports searching archived projects (or "origins") via their URL or metadata. We would like to extend search to also support searching within archived source code files, based on their textual content. Indexing all source code files archived by Software Heritage (~500-600 TB) is a major undertaking in terms of indexing time and storage. The goal of this internship is to design and implement a medium-scale prototype of such an index (covering, e.g., 0.1 to 1% of the archive) that will allow to evaluate the best indexing approach (e.g., which kind of index, tokenizer, etc.) as well as the time and resources that doing so will require (e.g., via extrapolation). The first technology that will be tried is ElasticSearch, as it's already deployed for other Software Heritage needs, but depending on the candidate other search engines can also be tested in the context of the internship.
Desirable skills to obtain this internship:
- Python development
- database administration experience (SQL and/or NoSQL and/or document-oriented)
Will be considered a plus:
- experience with full-text or code search
Workplace: on site at Inria Paris (contact mentors for remote opportunities)
Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.
Internship mentors: