Difference between revisions of "Add sources to the project search engine (GSoC task)"

Latest revision as of 06:33, 2 April 2022

Introduction

The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.

This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself

Task description

We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality. These sources include data received from deposit clients like HAL and data we will archive from swMATH, GitHub.com, ...

This comes with the following challenges:

there are multiple sources, and their contents must work together
sources have different reliability, that should be taken into account when ranking search results

Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.

Expected duration

175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium

Desirable skills

Python 3 and Git are a must to work on any Software Heritage project
ElasticSearch
Experience with cross-referenced data mining would be appreciated

Potential mentors

Kumar Shivendu (KShivendu on IRC)
Valentin Lorentz (vlorentz on IRC)
Vincent Sellier (vsellier on IRC)

Other relevant (but independent) tasks

This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.

This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic

@@ Line 14: / Line 14: @@
 typically sources that are not authoritative, but provide metadata of usually
 good quality.
+These sources include data received from [https://docs.softwareheritage.org/devel/swh-deposit/index.html deposit clients] like HAL and data we will archive from [https://swmath.org/ swMATH], [https://github.com/ GitHub.com], ...
 This comes with the following challenges:
-. there are multiple sources, and their contents must work together
+# there are multiple sources, and their contents must work together
-. sources have different reliability, that should be taken into account
+# sources have different reliability, that should be taken into account when ranking search results
-   when ranking search results
 Therefore, this task will require making a plan to address these,
@@ Line 25: / Line 25: @@
 It may involve some frontend work if necessary, to provide an interface for
 these.
+== Expected duration ==
+or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium
 == Desirable skills ==
@@ Line 34: / Line 37: @@
 == Potential mentors ==
+* Kumar Shivendu (KShivendu on [[IRC]])
 * Valentin Lorentz (vlorentz on [[IRC]])
-* Kumar Shivendu (KShivendu on [[IRC]])
+* Vincent Sellier (vsellier on [[IRC]])
 == Other relevant (but independent) tasks ==
@@ Line 49: / Line 53: @@
 [[Category:GSoC task]]
+[[Category:Available GSoC task]]

Difference between revisions of "Add sources to the project search engine (GSoC task)"

Latest revision as of 06:33, 2 April 2022

Contents

Introduction

Task description

Expected duration

Desirable skills

Potential mentors

Other relevant (but independent) tasks

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools