Add sources to the project search engine (GSoC task)
The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.
This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself
We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality. These sources include data received from deposit clients like HAL and data we will archive from swMATH, GitHub.com, ...
This comes with the following challenges:
- there are multiple sources, and their contents must work together
- sources have different reliability, that should be taken into account when ranking search results
Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.
175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium
- Python 3 and Git are a must to work on any Software Heritage project
- Experience with cross-referenced data mining would be appreciated
- Kumar Shivendu (KShivendu on IRC)
- Valentin Lorentz (vlorentz on IRC)
- Vincent Sellier (vsellier on IRC)
Other relevant (but independent) tasks
This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.
This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic