Difference between revisions of "Mine information from archived content (GSoC task)"

From Software Heritage Wiki
Jump to navigation Jump to search
(Add KShivendu)
 
(2 intermediate revisions by 2 users not shown)
Line 39: Line 39:
 
such as [https://github.com/datacite/bolognese Bolognese] or
 
such as [https://github.com/datacite/bolognese Bolognese] or
 
[https://github.com/librariesio/bibliothecary/ bibliothecary].
 
[https://github.com/librariesio/bibliothecary/ bibliothecary].
 +
 +
=== Expected duration ===
 +
 +
175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: medium
  
 
== Desirable skills ==
 
== Desirable skills ==
Line 49: Line 53:
 
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
 
* Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]])
 
* Valentin Lorentz (vlorentz on [[IRC]])
 
* Valentin Lorentz (vlorentz on [[IRC]])
 +
* Kumar Shivendu (KShivendu on [[IRC]])
  
 
[[Category:Available GSoC task]]
 
[[Category:Available GSoC task]]
 
[[Category:GSoC task]]
 
[[Category:GSoC task]]

Latest revision as of 06:42, 11 March 2022

Introduction

In addition to archival, Software Heritage indexes the retrieved source code artifacts, to enable semantic searches on the archive and scientific research.

Indexing can happen at the individual file-level (e.g., detect the programming language the file is written in or the license declared in its header), or at more coarse grained granularity (e.g., what metadata are declared for the most recently archived version of a given project).

A number of indexes are currently supported, such as:

  • file level mining:
    • MIME type detection (using libmagic)
    • license detection (using FOSSology/nomossa)
    • language detection (using Pygments)
    • ctags extraction (using universal-ctags)
  • project level mining:
    • Ruby gemspec metadata
    • Python PKG-INFO metadata
    • Maven pom.xml metadata
    • NPM package.json metadata

Task description

Writing additional indexers that extract more information from archived source code is welcome and would constitute a suitable GSoC project.

Name the kind of data mining you want to do!

For inspiration you can have a look at Libraries.io, as most package formats/package managers support dedicated ways of expressing metadata and we only support a small number of them up-to-now. But do not restrict your ambition to those, any kind of data extraction/mining you want to do on the archive could work.

You may also add support for multiple formats at once, using an external tool, such as Bolognese or bibliothecary.

Expected duration

175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: medium

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • Prior experience in working with (source code) metadata is a plus, but not required

Potential mentors

  • Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)
  • Valentin Lorentz (vlorentz on IRC)
  • Kumar Shivendu (KShivendu on IRC)