WG/Source Discovery and Ingestion

From Software Heritage Wiki
Jump to: navigation, search


Collecting the source code that is publicly available is an essential part of Software Heritage's mission. In order to fulfill this mission, we will need to discover, harvest, and keep up to date, content coming from a very diverse set of possible origins like:

  • several kinds of source code repositories, including:
    • mainstream development platforms, like GitHub, BitBucket, Sourceforge or Codeplex;
    • institutional forges, like Inria's, Cenatic's or Adullact's FusionForge;
    • community repositories like Debian's FusionForge or Joomla's gForge, Gnu's Savane, and the Apache or Eclipse custom repositories;
    • a variety of different source code archives, ranging from the Gnu ftp server to individual web pages

This is a challenging task, and in order to succeed, the involvement of a large community is needed.


The SODI working group's mission is to foster the development and the adoption of software components that can, for a given origin, make it discoverable, and list its contents, both in its entirety, which is useful for a newly discovered origin, and incrementally, which is useful to keep the Software Heritage archive up to date with the origin's evolving content.


This working group is open ended.

Expected outcomes

The main expected outcomes of the SODI working group are listed below:

APIs for discovering and tracking origins The SODI working group will define and evolve, in collaboration with the Software Heritage core team, standard APIs for software components that can be plugged in the Software Heritage infrastructure to track a (class of) origin(s). Whenever possible, proactive mechanisms for informing the Software Heritage infrastructure of content updates, like event feeds, will be preferred to approaches that require periodic polling of an origin.

Adoption The SODI working group will strive, whenever possible, to have such components incorporated in the upstream code of the forges, and adopted widely.

Awareness The SODI working group will establish the relevant connections in order to raise awareness among all the interested parties.


A first set of requirements for the API, and a preliminary draft of the API is expected to emerge from the work performed to ensure that Inria's own forge(s) are properly tracked in the Software Heritage architecture.

Related working groups

This working group is related to: Modeling and Ingesting Version control systems (MIV)

Team contact(s)


Documents produced by the working group will be listed in this section.


Active or planned connections to other initiatives, and activities, will be listed in this section.


Mailing list