Software Heritage in a bottle - local repository mining toolchain (internship)

From Software Heritage Wiki
Jump to navigation Jump to search

Context: Software Heritage is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.

Description: The goal of this internship is to develop a fully-automated "mini Software Heritage" pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.

This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:

  1. mining of private (or hybrid) development forges, e.g., for inner source industry scenarios
  2. testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the SWHSec project

Objectives: The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.

Specific objectives:

  1. Scenarization: automate the creation and management of local Git repositories (for testing purposes).
  2. Deploy a local SWH ingestion pipeline
  3. Automate the initial, periodic, and on-demand (re)crawling of Git repositories
  4. Periodically and on-demand export all indexed data in the same formats exported by the SWH archive (compressed graph and ORC formats).
  5. Develop a local SWH object storage instance to allow accessing individual file contents.
  6. Automate all relevant workflows in a Continuous Integration (CI) environment.

Expected outcomes: By the end of the internship, we expect the following deliverables:

  1. A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.
  2. A minimal local server setup that replicates the API functionalities of the SWH project.
  3. Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.
  4. Documentation and guidelines on using the developed tools and reproducing the test scenarios.

Desirable skills to obtain this internship:

  • Strong proficiency in Python.
  • Knowledge of Rust is a plus.
  • Familiarity with Git and related tools.
  • Basic understanding of software version control and indexing.
  • Experience with API development and server setup.
  • Ability to work independently and collaboratively in a team environment.

Workplace: on site at Télécom Paris (contact mentors for alternative hosting options)

Environment: The intern will be supervised by members of the SWH and SWHSec project teams, which includes experts in software preservation, security, and data management.

Internship mentors:

  • Samuel Tardieu <samuel.tardieu@telecom-paris.fr> (Sam on Matrix)
  • Stefano Zacchiroli <stefano.zacchiroli@telecom-paris.fr> (Zack on Matrix)

See also