Add sources to the project search engine (GSoC task)
Introduction
The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.
This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself
Task description
We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality.
This comes with the following challenges:
- there are multiple sources, and their contents must work together
- sources have different reliability, that should be taken into account when ranking search results
Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- ElasticSearch
- Experience with cross-referenced data mining would be appreciated
Potential mentors
- Kumar Shivendu (KShivendu on IRC)
- Valentin Lorentz (vlorentz on IRC)
- Vincent Sellier (vsellier on IRC)
Other relevant (but independent) tasks
This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.
This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic