Difference between revisions of "Improve the Vault (GSoC task)"
Jump to navigation
Jump to search
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | == Introduction == | ||
+ | |||
The Software Heritage archive allows retrieval of archived objects of different formats. | The Software Heritage archive allows retrieval of archived objects of different formats. | ||
Once an object has been chosen for retrieval, it can be "cooked" using the [https://docs.softwareheritage.org/devel/swh-vault/index.html Software Heritage Vault]. | Once an object has been chosen for retrieval, it can be "cooked" using the [https://docs.softwareheritage.org/devel/swh-vault/index.html Software Heritage Vault]. | ||
+ | |||
+ | == Task description == | ||
Right now the Vault has several limitations: it only handles two kinds of objects (revisions and directories), it requires recursively requesting the database to get the full subgraph of an object, and it generates revisions in an unpractical format (git fast-import). | Right now the Vault has several limitations: it only handles two kinds of objects (revisions and directories), it requires recursively requesting the database to get the full subgraph of an object, and it generates revisions in an unpractical format (git fast-import). | ||
Line 11: | Line 15: | ||
* improve unit and end-to-end testing | * improve unit and end-to-end testing | ||
* other general code improvements (better progress/error reporting in the frontend, etc.) | * other general code improvements (better progress/error reporting in the frontend, etc.) | ||
+ | |||
+ | == Desirable skills == | ||
+ | |||
+ | * Python 3 and Git are a must to work on any Software Heritage project | ||
+ | * Good understanding of the Git data model | ||
+ | * Basic understanding of asynchronous Python programming will be useful to read the existing code | ||
+ | * Django and web development, if you want to work on the frontend | ||
+ | |||
+ | == Potential mentors == | ||
+ | |||
+ | * Nicolas Dandrimont (olasd on [[IRC]]) | ||
+ | * Stefano Zacchiroli <zack@upsilon.cc> (zack on [[IRC]]) | ||
+ | * Valentin Lorentz (vlorentz on [[IRC]]) | ||
[[Category:GSoC task]] | [[Category:GSoC task]] |
Latest revision as of 10:33, 2 March 2021
Introduction
The Software Heritage archive allows retrieval of archived objects of different formats. Once an object has been chosen for retrieval, it can be "cooked" using the Software Heritage Vault.
Task description
Right now the Vault has several limitations: it only handles two kinds of objects (revisions and directories), it requires recursively requesting the database to get the full subgraph of an object, and it generates revisions in an unpractical format (git fast-import).
Several improvements are possible:
- add coverage for new kinds of objects (releases, snapshots and even origins?)
- use our in-memory graph database swh-graph to speed up fetching the necessary subgraphs.
- write cookers to output new formats (e.g git tarballs/git bundles or even other VCS?)
- improve unit and end-to-end testing
- other general code improvements (better progress/error reporting in the frontend, etc.)
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- Good understanding of the Git data model
- Basic understanding of asynchronous Python programming will be useful to read the existing code
- Django and web development, if you want to work on the frontend