Reverse project phylogenesis (internship)

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage Archive contains a large number of projects (over one million!) salvaged from code hosting platforms that have been closed down, ranging from large ones like Google Code or, to small institutional ones, like the old Inria gForge, and from platforms that have phased out support for some version control systems, like Bitbucket. Some of these projects migrated to other platforms, where they continued their development.

This means that there are documents, research articles, blog posts, documentation, and many other sources out there that contain broken links: the Software Heritage archive provides a way to find easily the archived version of the project, but does not help identifying the new repository where the development may have migrated.

The goal of this internship is to explore heuristics that exploit the special feature of the Software Heritage merkle graph to identify repositories that may be the new development strand of an old repository saved from a discontinued platform, and show these links in the relevant repositories: this corresponds to developing the phylogenesis of a software project.

One of the challenges will be to compare various heuristics and scale the approach up to the millions of repositories involved.

Desirable skills to obtain this internship:

  • Python development
  • understanding of version control systems (git in particular) and familiarity with code hosting platforms

Workplace: on site at Inria Paris (contact mentors for remote opportunities)

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

  • Roberto Di Cosmo <> (rdicosmo on IRC)

