Difference between revisions of "Archive search query language (internship)"

From Software Heritage Wiki
Jump to: navigation, search
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Context''': {{Internship context}}
+
{{Internship
 
+
|description=The current [https://archive.softwareheritage.org/browse/search/ archive search engine] accepts a single list of tokens that are searched either in across origin URLs or [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ extracted metadata].
'''Description''': The current [https://archive.softwareheritage.org/browse/search/ archive search engine] accepts a single list of tokens that are searched either in across origin URLs or [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ extracted metadata].
 
 
The goal of this internship is to design and implement an archive query language that allows to mix terms, boolean connectors, and operators (as [https://support.google.com/websearch/answer/2466433 other] [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html search] [https://help.qwant.com/help/qwant-junior/refine-search-with-operators/ engines] do).
 
The goal of this internship is to design and implement an archive query language that allows to mix terms, boolean connectors, and operators (as [https://support.google.com/websearch/answer/2466433 other] [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html search] [https://help.qwant.com/help/qwant-junior/refine-search-with-operators/ engines] do).
 
Part of this internship is language design, part language parsing, and part language evaluation (by linking together query results returned by the already existing search backends for the archive).
 
Part of this internship is language design, part language parsing, and part language evaluation (by linking together query results returned by the already existing search backends for the archive).
  
'''Desirable skills''' to obtain this internship:
+
|skills=
 
* Python development
 
* Python development
  
Line 11: Line 10:
 
* experience with data mining and/or information retrieval and/or web searches
 
* experience with data mining and/or information retrieval and/or web searches
  
'''Workplace''': {{Internship workplace}}
+
|mentors=
 
+
* Antoine Lambert (anlambert on [[IRC]])
'''Environment''': {{Internship environment}}
+
* Valentin Lorentz (vlorentz on [[IRC]])
 
+
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
'''Internship mentors''':
+
}}
* Antoine Lambert
 
* Valentin Lorentz
 
* Stefano Zacchiroli <zack@upsilon.cc>
 
 
 
  
[[Category:Available internship]]
+
[[Category:Completed internship]]
[[Category:Internship]]
 
[[Category:Lang:English]]
 

Latest revision as of 12:43, 16 February 2022

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The current archive search engine accepts a single list of tokens that are searched either in across origin URLs or extracted metadata. The goal of this internship is to design and implement an archive query language that allows to mix terms, boolean connectors, and operators (as other search engines do). Part of this internship is language design, part language parsing, and part language evaluation (by linking together query results returned by the already existing search backends for the archive).

Desirable skills to obtain this internship:

  • Python development

Will be considered a plus:

  • experience with data mining and/or information retrieval and/or web searches

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Antoine Lambert (anlambert on IRC)
  • Valentin Lorentz (vlorentz on IRC)
  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)

See also