Difference between revisions of "Graph query language for the archive (internship)"

From Software Heritage Wiki
Jump to: navigation, search
(not listed yet)
m
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{Internship
 
{{Internship
|description=The Software Heritage archive is structured as a graph (specifically, a [https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: tens of billion nodes, hundreds of billion edges.
+
|description=The Software Heritage archive is structured as a graph (specifically, a [https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: 20 billion nodes, 200 billion edges.
It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: [https://docs.softwareheritage.org/devel/swh-graph/ documentation] and in particular the paper referenced there).
+
It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: [https://docs.softwareheritage.org/devel/swh-graph/ documentation] and in particular the [https://arxiv.org/abs/2001.08647 SANER 2020 paper referenced there]).
 
An [https://docs.softwareheritage.org/devel/swh-graph/api.html ad hoc API] is available to traverse the graph, with very limited querying capabilities.
 
An [https://docs.softwareheritage.org/devel/swh-graph/api.html ad hoc API] is available to traverse the graph, with very limited querying capabilities.
 
The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art [https://en.wikipedia.org/wiki/Graph_database#Graph_query-programming_languages graph query languages].
 
The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art [https://en.wikipedia.org/wiki/Graph_database#Graph_query-programming_languages graph query languages].
 +
 
The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps.
 
The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps.
The tentative target platform for implementation is [https://neo4j.com/ Neo4j] extended with a compatibility layer for WebGraph [http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html ImmutableGraph]s, but other options might be considered.
+
A prototype implementation will be developed for a target platform to be chosen during the internship.
 +
Tentative target platforms include: [https://neo4j.com/ Neo4j] and [http://tinkerpop.apache.org/ Apache TinkerPop].
 +
In either case a backend/compatibility layer for WebGraph [http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html ImmutableGraph] will be developed.
  
 
|skills=
 
|skills=
Line 12: Line 15:
  
 
Will be considered a plus:
 
Will be considered a plus:
 +
* experience with graph traversal languages (e.g., [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin])
 
* experience with graph databases (e.g., [https://neo4j.com/ Neo4j])
 
* experience with graph databases (e.g., [https://neo4j.com/ Neo4j])
 +
 +
|workplace=[https://liris.cnrs.fr/en LIRIS (Univ. Lyon 1, Lyon)] or [https://www.inria.fr/en/centre-inria-de-paris Inria Paris] or contact mentors for remote opportunities
  
 
|mentors=
 
|mentors=
* [https://perso.liris.cnrs.fr/angela.bonifati/ Angela Bonifati] <angela.bonifati@gmail.com>
+
* [https://perso.liris.cnrs.fr/angela.bonifati/ Angela Bonifati] <angela.bonifati@univ-lyon1.fr>
* Stefano Zacchiroli <zack@upsilon.cc>
+
* Stefano Zacchiroli <zack@upsilon.cc> (Zack on [[Matrix]])
 
}}
 
}}
  
[[Category:Internship]]
+
[[Category:Available internship]]
[[Category:Lang:English]]
 

Latest revision as of 15:08, 4 February 2024

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: The Software Heritage archive is structured as a graph (specifically, a Merkle DAG) and is huge: 20 billion nodes, 200 billion edges. It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: documentation and in particular the SANER 2020 paper referenced there). An ad hoc API is available to traverse the graph, with very limited querying capabilities. The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art graph query languages.

The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps. A prototype implementation will be developed for a target platform to be chosen during the internship. Tentative target platforms include: Neo4j and Apache TinkerPop. In either case a backend/compatibility layer for WebGraph ImmutableGraph will be developed.

Desirable skills to obtain this internship:

  • Java development
  • Query languages for structured, semi-structured, or graph data (e.g., one or more among: SQL, Xquery, GraphQL, SPARQL, GQL, etc.)

Will be considered a plus:

  • experience with graph traversal languages (e.g., Gremlin)
  • experience with graph databases (e.g., Neo4j)

Workplace: LIRIS (Univ. Lyon 1, Lyon) or Inria Paris or contact mentors for remote opportunities

Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the great library of source code.

Internship mentors:

See also