Difference between revisions of "Google Summer of Code 2019/Graph compression"

From Software Heritage Wiki
Jump to navigation Jump to search
m (Update GSoC portal link)
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
* '''Title:''' graph compression
+
* '''Title:''' Graph compression
* '''Description:''' The Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] is a big [https://en.wikipedia.org/wiki/Merkle_tree Merkle] DAG made of nodes like revisions, releases, directories, etc. It is a very big graph, with ~10 B nodes and ~100 B edges, which makes it hard to fit in memory using naive approaches. Graph compression techniques have been successfully used to compress the Web graph (which is slightly larger than the Software Heritage one) and make it fit in memory. The goal of this GSoC project is review existing graph compression techniques and apply the most appropriate one to the Software Heritage case, enabling in-memory processing of its Merkle DAG.
+
 
* '''Student:''' Thibault Allançon
+
* '''Description:''' The Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] is a big [https://en.wikipedia.org/wiki/Merkle_tree Merkle] DAG made of nodes like revisions, releases, directories, etc. It is a very big graph, with 12 billion nodes and 165 billion edges, which makes it hard to fit in memory using naive approaches. Graph compression techniques have been successfully used to compress the Web graph (which is slightly larger than the Software Heritage one) and make it fit in memory. The goal of this GSoC project is review existing graph compression techniques and apply the most appropriate one to the Software Heritage case, enabling in-memory processing of its Merkle DAG.
 +
 
 +
* '''Student:''' Thibault 'haltode' Allançon
 +
** [https://forge.softwareheritage.org/p/haltode/ forge activity]
 +
** [https://summerofcode.withgoogle.com/archive/2019/projects/6283725761937408/ project on the GSoC portal]
 +
 
 +
* '''What was done:'''
 +
** Git repo: [https://forge.softwareheritage.org/source/swh-graph/ swh-graph] (list of my [https://wiki.softwareheritage.org/wiki/Google_Summer_of_Code_2019/Graph_compression/Commit_list commits])
 +
** Research on graph compression: evaluate/experiment feasibility and compression rate of multiple techniques
 +
** Docker environment and scripts to automate the entire compression pipeline
 +
** REST API to query the graph
 +
*** Java server side: load the compressed graph, run API endpoints traversal algorithms, unit tests, javadoc
 +
*** Python client side: integration with SWH infrastructure, integration tests
 +
** Automated benchmarking tools with statistical analysis
 +
** General documentation on docker environment, compression steps, graph query use-cases, etc.
 +
 
 +
* '''Highlights:'''
 +
** Compression rates: 4.48 bits/edge (transposed graph) and 4.91 bits/edge (direct graph)
 +
** Memory requirements for loading both graphs: 200GB of RAM
 +
** Total compression time: 1 week
 +
** Node successors lookup times: below 2μs
 +
 
 
* '''Mentors:'''
 
* '''Mentors:'''
 
** [[User:StefanoZacchiroli|Stefano Zacchiroli]]
 
** [[User:StefanoZacchiroli|Stefano Zacchiroli]]
 
** Antoine Pietri
 
** Antoine Pietri
* '''Weekly reports:'''
 
** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-05/msg00000.html 2019/19]
 
  
== Links ==
+
* '''Activity reports:'''
 +
** [https://haltode.fr/gsoc2019/march.html March 2019]
 +
** [https://haltode.fr/gsoc2019/april.html April 2019]
 +
** May 2019
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-05/msg00000.html week 2019/19]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-05/msg00002.html week 2019/20]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-05/msg00005.html week 2019/21]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-05/msg00015.html week 2019/22]
 +
** June 2019
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-06/msg00007.html week 2019/23]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-06/msg00013.html week 2019/24]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-06/msg00023.html week 2019/25]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-06/msg00031.html week 2019/26]
 +
** July 2019
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-07/msg00001.html week 2019/27]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-07/msg00005.html week 2019/28]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-07/msg00009.html week 2019/29]
 +
** August 2019
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-08/msg00000.html week 2019/31]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-08/msg00003.html week 2019/32]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-08/msg00006.html week 2019/33]
 +
*** [https://sympa.inria.fr/sympa/arc/swh-devel/2019-08/msg00010.html week 2019/34]
  
* [https://forge.softwareheritage.org/source/graph-compression/ source code repository]
+
[[Category:Google Summer of Code]]
* see project [https://summerofcode.withgoogle.com/projects/#4670175868092416 on the GSoC portal]
+
[[Category:Google Summer of Code 2019]]

Latest revision as of 16:53, 11 December 2019

  • Title: Graph compression
  • Description: The Software Heritage data model is a big Merkle DAG made of nodes like revisions, releases, directories, etc. It is a very big graph, with 12 billion nodes and 165 billion edges, which makes it hard to fit in memory using naive approaches. Graph compression techniques have been successfully used to compress the Web graph (which is slightly larger than the Software Heritage one) and make it fit in memory. The goal of this GSoC project is review existing graph compression techniques and apply the most appropriate one to the Software Heritage case, enabling in-memory processing of its Merkle DAG.
  • What was done:
    • Git repo: swh-graph (list of my commits)
    • Research on graph compression: evaluate/experiment feasibility and compression rate of multiple techniques
    • Docker environment and scripts to automate the entire compression pipeline
    • REST API to query the graph
      • Java server side: load the compressed graph, run API endpoints traversal algorithms, unit tests, javadoc
      • Python client side: integration with SWH infrastructure, integration tests
    • Automated benchmarking tools with statistical analysis
    • General documentation on docker environment, compression steps, graph query use-cases, etc.
  • Highlights:
    • Compression rates: 4.48 bits/edge (transposed graph) and 4.91 bits/edge (direct graph)
    • Memory requirements for loading both graphs: 200GB of RAM
    • Total compression time: 1 week
    • Node successors lookup times: below 2μs