Difference between revisions of "Add sources to the project search engine (GSoC task)"
(add vsellier as mentor) |
m (→Task description: fix markup for ordered list) |
||
Line 17: | Line 17: | ||
This comes with the following challenges: | This comes with the following challenges: | ||
− | + | # there are multiple sources, and their contents must work together | |
− | + | # sources have different reliability, that should be taken into account when ranking search results | |
Therefore, this task will require making a plan to address these, | Therefore, this task will require making a plan to address these, |
Revision as of 09:08, 18 March 2022
Contents
Introduction
The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.
This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself
Task description
We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality.
This comes with the following challenges:
- there are multiple sources, and their contents must work together
- sources have different reliability, that should be taken into account when ranking search results
Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- ElasticSearch
- Experience with cross-referenced data mining would be appreciated
Potential mentors
- Kumar Shivendu (KShivendu on IRC)
- Valentin Lorentz (vlorentz on IRC)
- Vincent Sellier (vsellier on IRC)
Other relevant (but independent) tasks
This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.
This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic