Difference between revisions of "Add sources to the project search engine (GSoC task)"

From Software Heritage Wiki
Jump to: navigation, search
(add vsellier as mentor)
(Task description: Add example of data sources)
 
(2 intermediate revisions by 2 users not shown)
Line 14: Line 14:
 
typically sources that are not authoritative, but provide metadata of usually
 
typically sources that are not authoritative, but provide metadata of usually
 
good quality.
 
good quality.
 +
These sources include data received from [https://docs.softwareheritage.org/devel/swh-deposit/index.html deposit clients] like HAL and data we will archive from [https://swmath.org/ swMATH], [https://github.com/ GitHub.com], ...
  
 
This comes with the following challenges:
 
This comes with the following challenges:
  
1. there are multiple sources, and their contents must work together
+
# there are multiple sources, and their contents must work together
2. sources have different reliability, that should be taken into account when ranking search results
+
# sources have different reliability, that should be taken into account when ranking search results
  
 
Therefore, this task will require making a plan to address these,
 
Therefore, this task will require making a plan to address these,
Line 24: Line 25:
 
It may involve some frontend work if necessary, to provide an interface for
 
It may involve some frontend work if necessary, to provide an interface for
 
these.
 
these.
 +
 +
== Expected duration ==
 +
175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium
  
 
== Desirable skills ==
 
== Desirable skills ==

Latest revision as of 06:33, 2 April 2022

Introduction

The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.

This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself

Task description

We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality. These sources include data received from deposit clients like HAL and data we will archive from swMATH, GitHub.com, ...

This comes with the following challenges:

  1. there are multiple sources, and their contents must work together
  2. sources have different reliability, that should be taken into account when ranking search results

Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.

Expected duration

175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • ElasticSearch
  • Experience with cross-referenced data mining would be appreciated

Potential mentors

  • Kumar Shivendu (KShivendu on IRC)
  • Valentin Lorentz (vlorentz on IRC)
  • Vincent Sellier (vsellier on IRC)

Other relevant (but independent) tasks

This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.

This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic