Difference between revisions of "Improve project search engine (GSoC task)"

From Software Heritage Wiki
Jump to: navigation, search
(add links to the indexing tasks)
(Potential mentors: Remove zack from mentors)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
== Introduction ==
 +
 
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]
 
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]
 
features a small search engine, that searched in project URLs and project metadata.
 
features a small search engine, that searched in project URLs and project metadata.
Line 5: Line 7:
 
By the time GSoC starts, it will be implemented by a very small Python service
 
By the time GSoC starts, it will be implemented by a very small Python service
 
(under 100 lines of code) backed by ElasticSearch.
 
(under 100 lines of code) backed by ElasticSearch.
 +
 +
== Task description ==
  
 
This service is a MVP (Minimum Viable Product) that was written to replace an
 
This service is a MVP (Minimum Viable Product) that was written to replace an
older service, based on Postgresql, which was too slow.
+
older service, based on PostgreSQL, which was too slow.
 
So there is a lot of room for improvement in terms of adding features and
 
So there is a lot of room for improvement in terms of adding features and
 
making results more relevant.
 
making results more relevant.
Line 18: Line 22:
 
Depending on your preferences, this can be either purely backend changes,
 
Depending on your preferences, this can be either purely backend changes,
 
or may involve changes to the web interface to use these new features.
 
or may involve changes to the web interface to use these new features.
 +
 +
== Desirable skills ==
 +
 +
* Python 3 and Git are a must to work on any Software Heritage project
 +
* either ElasticSearch (if you want to improve the existing search component) or an alternative (if you want to rewrite it with a different backend)
 +
 +
== Potential mentors ==
 +
 +
* Valentin Lorentz (vlorentz on [[IRC]])
 +
* Vincent Sellier (vsellier on [[IRC]])
 +
 +
== Other relevant (but independent) tasks ==
  
 
This task is only about searching on the existing Elasticsearch database;
 
This task is only about searching on the existing Elasticsearch database;
Line 23: Line 39:
 
and [[Mine information from external sources (GSoC task)]] to fill this
 
and [[Mine information from external sources (GSoC task)]] to fill this
 
database; but those are completely independent tasks.
 
database; but those are completely independent tasks.
 +
 +
This database only contains project URLs and metadata, not source code.
 +
Source code search is more complex, but is available as an
 +
[[Source code search engine prototype (internship)|internship topic]]
  
 
[[Category:GSoC task]]
 
[[Category:GSoC task]]

Latest revision as of 09:41, 11 March 2021

Introduction

The homepage of the Software Heritage archive features a small search engine, that searched in project URLs and project metadata. Project metadata includes name, description, authors, etc.

By the time GSoC starts, it will be implemented by a very small Python service (under 100 lines of code) backed by ElasticSearch.

Task description

This service is a MVP (Minimum Viable Product) that was written to replace an older service, based on PostgreSQL, which was too slow. So there is a lot of room for improvement in terms of adding features and making results more relevant.

Most of the features we have in mind are to allow finer search on project metadata, instead of simply doing a full-text search on the entire metadata; but we are open to suggestions.

Depending on your preferences, this can be either purely backend changes, or may involve changes to the web interface to use these new features.

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • either ElasticSearch (if you want to improve the existing search component) or an alternative (if you want to rewrite it with a different backend)

Potential mentors

  • Valentin Lorentz (vlorentz on IRC)
  • Vincent Sellier (vsellier on IRC)

Other relevant (but independent) tasks

This task is only about searching on the existing Elasticsearch database; you may also be interested in Mine information from archived content (GSoC task) and Mine information from external sources (GSoC task) to fill this database; but those are completely independent tasks.

This database only contains project URLs and metadata, not source code. Source code search is more complex, but is available as an internship topic