Graph compression on the development history of software (internship)
Graph compression on the development history of software
Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.
Description: The Software Heritage data model is a big Merkle DAG made of nodes like revisions, releases, directories, etc. It is a very big graph, with ~10 B nodes and ~100 B edges, which makes it hard to fit in memory using naive approaches. Graph compression techniques have been successfully used to compress the Web graph (which is slightly larger than the Software Heritage one) and make it fit in memory. The goal of this internship is review existing graph compression techniques and apply the most appropriate one to the Software Heritage case, enabling in-memory processing of its Merkle DAG.
Desirable skills to obtain this internship:
- familiarity with system programming
- familiarity with basic graph algorithms
- C development
- working knowledge of basic compression techniques from information theory would be a plus
Workplace: Inria Paris
Environment: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the ultimate source code archive.
Internship mentors:
- Francesca Bassi <francesca.bassi@l2s.centralesupelec.fr>
- Stefano Zacchiroli <zack@upsilon.cc>
References:
- Paolo Boldi, Sebastiano Vigna, The webgraph framework I: compression techniques, Proceedings of the 13th international conference on World Wide Web. ACM, 2004. pdf
- Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. arxiv
- Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, Alon Shalita, Compressing Graphs and Indexes with Recursive Graph Bisection. arxiv
- WebGraph (Java framework that implements the techniques above)