Source code search engine prototype (internship)

From Software Heritage Wiki
Jump to: navigation, search

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The current archive search engine supports searching archived projects (or "origins") via their URL or metadata. We would like to extend search to also support searching within archived source code files, based on their textual content. Indexing all source code files archived by Software Heritage (~200-300 TB) is a major undertaking in terms of indexing time and storage. The goal of this internship is to design and implement a medium-scale prototype of such an index (covering, e.g., 1% of the archive) that will allow to evaluate the best indexing approach (e.g., which kind of index, tokenizer, etc.) as well as the time and resources that doing so will require (e.g., via extrapolation). The first technology that will be tried is ElasticSearch, as it's already deployed for other Software Heritage needs, but depending on the candidate other search engines can also be tested in the context of the internship.

Desirable skills to obtain this internship:

  • Python development
  • database administration experience (SQL and/or NoSQL and/or document-oriented)

Will be considered a plus:

  • experience with full-text or code search

Workplace: Inria Paris

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Stefano Zacchiroli <zack@upsilon.cc>