Difference between revisions of "WG/Source Discovery and Ingestion"

From Software Heritage Wiki
Jump to navigation Jump to search
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Charter ==
+
= Charter =
  
This working group is in charge of the issues related to adding and updating external sources of content to the archive.
+
Collecting the source code that is publicly available is an essential part of Software Heritage's mission.
It is expected to work both on technical and non technical aspects
+
In order to fulfill this mission, we will need to discover, harvest, and keep up to date, content coming from a very diverse
 +
set of possible origins like:
  
=== Technical aspects ===
+
* several kinds of [https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities source code repositories], including:
* an API for the archive to interact with an external source (forge, repository, web page, etc.) in order to perform at least the following tasks
+
** mainstream development platforms, like GitHub, BitBucket, Sourceforge or Codeplex;
** list the full content of a source
+
** institutional forges, like Inria's, Cenatic's or Adullact's FusionForge;
** get the list of updates in a source since a given moment in time
+
** community repositories like Debian's FusionForge or Joomla's gForge, Gnu's [http://savannah.gnu.org/ Savane], and the [http://www.apache.org/dyn/closer.cgi/ Apache] or [https://git.eclipse.org/c/ Eclipse] custom repositories;
* mechanisms for discovering new sources
+
** a variety of different source code archives, ranging from the [ftp://ftp.gnu.org Gnu ftp server] to individual web pages
  
=== Non technical aspects ===
+
This is a challenging task, and in order to succeed, the involvement of a large community is needed.
  
The working group members will contribute to raise awareness, and foster broad adoption of the APIs and mechanisms defined.
+
== Mission ==
The main objective is to make sure that there will be numerous external contributors willing to write software components adapted to the different kinds of existing sources of content.
+
The SODI working group's mission is to foster ''the development'' and
 +
''the adoption'' of software components that can, for a given origin, make it
 +
''discoverable'', and ''list'' its contents, both ''in its entirety'', which is
 +
useful for a newly discovered origin, and ''incrementally'', which is useful
 +
to keep the Software Heritage archive up to date with the origin's evolving
 +
content.
  
== Group animation ==
+
== Duration ==
 +
This working group is open ended.
  
== Resources ==
+
== Expected outcomes ==
 +
The main expected outcomes of the SODI working group are listed below:
  
Mailing list: [https://sympa.inria.fr/sympa/info/sodi-wg-swh sodi-wg-swh]
+
''APIs for discovering and tracking origins''
 +
The SODI working group will define and evolve, in collaboration with the Software Heritage core team, standard APIs for software components  that can be plugged in the Software Heritage infrastructure to track a  (class of) origin(s). Whenever possible, proactive mechanisms for informing the Software Heritage infrastructure of content updates, like event feeds, will be preferred to approaches that require periodic polling of an origin.
  
 +
''Adoption''
 +
The SODI working group will strive, whenever possible, to have such components incorporated in the upstream code of the forges, and adopted widely.
  
 +
''Awareness''
 +
The SODI working group will establish the relevant connections in order to  raise awareness among all the interested parties.
  
 +
== Milestones ==
 +
A first set of requirements for the API, and a preliminary draft of the API is
 +
expected to emerge from the work performed to ensure that Inria's own forge(s)
 +
are properly tracked in the Software Heritage architecture.
  
 +
== Related working groups ==
 +
This working group is related to: [[Modeling_and_Ingesting_Version_control_systems | Modeling and Ingesting Version control systems (MIV)]]
  
[[Category: technical]]
+
= Team contact(s) =
[[Category: ingestion]]
+
* [http://upsilon.cc/~zack/ Stefano Zacchiroli]
 +
 
 +
= Documents =
 +
Documents produced by the working group will be listed in this section.
 +
 
 +
= Connections =
 +
Active or planned connections to other initiatives, and activities, will be listed in this section.
 +
 
 +
= Infrastructure =
 +
== Mailing list ==
 +
* https://sympa.inria.fr/sympa/info/sodi-wg-swh
 +
 
 +
 
 +
[[Category:Working group]]

Latest revision as of 13:43, 31 July 2016

Charter

Collecting the source code that is publicly available is an essential part of Software Heritage's mission. In order to fulfill this mission, we will need to discover, harvest, and keep up to date, content coming from a very diverse set of possible origins like:

  • several kinds of source code repositories, including:
    • mainstream development platforms, like GitHub, BitBucket, Sourceforge or Codeplex;
    • institutional forges, like Inria's, Cenatic's or Adullact's FusionForge;
    • community repositories like Debian's FusionForge or Joomla's gForge, Gnu's Savane, and the Apache or Eclipse custom repositories;
    • a variety of different source code archives, ranging from the Gnu ftp server to individual web pages

This is a challenging task, and in order to succeed, the involvement of a large community is needed.

Mission

The SODI working group's mission is to foster the development and the adoption of software components that can, for a given origin, make it discoverable, and list its contents, both in its entirety, which is useful for a newly discovered origin, and incrementally, which is useful to keep the Software Heritage archive up to date with the origin's evolving content.

Duration

This working group is open ended.

Expected outcomes

The main expected outcomes of the SODI working group are listed below:

APIs for discovering and tracking origins The SODI working group will define and evolve, in collaboration with the Software Heritage core team, standard APIs for software components that can be plugged in the Software Heritage infrastructure to track a (class of) origin(s). Whenever possible, proactive mechanisms for informing the Software Heritage infrastructure of content updates, like event feeds, will be preferred to approaches that require periodic polling of an origin.

Adoption The SODI working group will strive, whenever possible, to have such components incorporated in the upstream code of the forges, and adopted widely.

Awareness The SODI working group will establish the relevant connections in order to raise awareness among all the interested parties.

Milestones

A first set of requirements for the API, and a preliminary draft of the API is expected to emerge from the work performed to ensure that Inria's own forge(s) are properly tracked in the Software Heritage architecture.

Related working groups

This working group is related to: Modeling and Ingesting Version control systems (MIV)

Team contact(s)

Documents

Documents produced by the working group will be listed in this section.

Connections

Active or planned connections to other initiatives, and activities, will be listed in this section.

Infrastructure

Mailing list