Add sources to the project search engine (GSoC task)

From Software Heritage Wiki
Jump to navigation Jump to search

Introduction

The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.

This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself

Task description

We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality. These sources include data received from deposit clients like HAL and data we will archive from swMATH, GitHub.com, ...

This comes with the following challenges:

  1. there are multiple sources, and their contents must work together
  2. sources have different reliability, that should be taken into account when ranking search results

Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.

Expected duration

175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • ElasticSearch
  • Experience with cross-referenced data mining would be appreciated

Potential mentors

  • Kumar Shivendu (KShivendu on IRC)
  • Valentin Lorentz (vlorentz on IRC)
  • Vincent Sellier (vsellier on IRC)

Other relevant (but independent) tasks

This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.

This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic