Difference between revisions of "Mine information from external sources (GSoC task)"

From Software Heritage Wiki
Jump to: navigation, search
m (Possible metadata sources)
(Add KShivendu)
 
(4 intermediate revisions by 2 users not shown)
Line 18: Line 18:
 
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources
 
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources
  
=== Possible metadata sources ===
+
=== Expected duration ===
* the [https://directory.fsf.org FSF Directory]
+
 
 +
175 or 350 hours, at your option (longer duration means you can tackle more formats).
 +
Difficulty: hard
  
 
== Desirable skills ==
 
== Desirable skills ==
Line 30: Line 32:
 
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
 
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
 
* Valentin Lorentz (vlorentz on [[IRC]])
 
* Valentin Lorentz (vlorentz on [[IRC]])
 +
* Kumar Shivendu (KShivendu on [[IRC]])
  
 +
[[Category:Available GSoC task]]
 
[[Category:GSoC task]]
 
[[Category:GSoC task]]

Latest revision as of 06:42, 11 March 2022

Introduction

In addition to archiving source code artifacts, Software Heritage is interested in archive metadata from external sources and correlate it to source code artifacts. This is also to enable semantic searches on the archive and scientific research.

Collecting this extrinsic metadata is a work in progress, and you are welcome to contribute to its implementation.

Task description

You would contribute to the design of our metadata-fetching architecture. This includes:

  • Review what metadata we want to fetch
  • How to efficiently fetch it at regular intervals and store it
  • Implement metadata fetching from at least one source, in a way that can be generalized to other sources

Expected duration

175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: hard

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • Prior experience in working with software metadata is a plus, but not required

Potential mentors

  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)
  • Valentin Lorentz (vlorentz on IRC)
  • Kumar Shivendu (KShivendu on IRC)