Difference between revisions of "Add sources to the project search engine (GSoC task)"
(Created page with "== Introduction == The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive] features a small search engine, that searched in project URLs and pro...") |
(→Task description: Add example of data sources) |
||
(5 intermediate revisions by 3 users not shown) | |||
Line 14: | Line 14: | ||
typically sources that are not authoritative, but provide metadata of usually | typically sources that are not authoritative, but provide metadata of usually | ||
good quality. | good quality. | ||
+ | These sources include data received from [https://docs.softwareheritage.org/devel/swh-deposit/index.html deposit clients] like HAL and data we will archive from [https://swmath.org/ swMATH], [https://github.com/ GitHub.com], ... | ||
This comes with the following challenges: | This comes with the following challenges: | ||
− | + | # there are multiple sources, and their contents must work together | |
− | + | # sources have different reliability, that should be taken into account when ranking search results | |
− | |||
Therefore, this task will require making a plan to address these, | Therefore, this task will require making a plan to address these, | ||
Line 25: | Line 25: | ||
It may involve some frontend work if necessary, to provide an interface for | It may involve some frontend work if necessary, to provide an interface for | ||
these. | these. | ||
+ | |||
+ | == Expected duration == | ||
+ | 175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium | ||
== Desirable skills == | == Desirable skills == | ||
Line 34: | Line 37: | ||
== Potential mentors == | == Potential mentors == | ||
+ | * Kumar Shivendu (KShivendu on [[IRC]]) | ||
* Valentin Lorentz (vlorentz on [[IRC]]) | * Valentin Lorentz (vlorentz on [[IRC]]) | ||
− | * | + | * Vincent Sellier (vsellier on [[IRC]]) |
== Other relevant (but independent) tasks == | == Other relevant (but independent) tasks == | ||
Line 49: | Line 53: | ||
[[Category:GSoC task]] | [[Category:GSoC task]] | ||
+ | [[Category:Available GSoC task]] |
Latest revision as of 06:33, 2 April 2022
Contents
Introduction
The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.
This is implemented by a Python service backed by an ElasticSearch database, which contains one document for each project; each document containing metadata mined from the project itself
Task description
We would like to add more data sources to the ElasticSearch database; typically sources that are not authoritative, but provide metadata of usually good quality. These sources include data received from deposit clients like HAL and data we will archive from swMATH, GitHub.com, ...
This comes with the following challenges:
- there are multiple sources, and their contents must work together
- sources have different reliability, that should be taken into account when ranking search results
Therefore, this task will require making a plan to address these, define a data model, and finally implement it in a backend. It may involve some frontend work if necessary, to provide an interface for these.
Expected duration
175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- ElasticSearch
- Experience with cross-referenced data mining would be appreciated
Potential mentors
- Kumar Shivendu (KShivendu on IRC)
- Valentin Lorentz (vlorentz on IRC)
- Vincent Sellier (vsellier on IRC)
Other relevant (but independent) tasks
This task is only about adding data we already collected the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.
This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic