Improve the Code Scanner (GSoC task)
Revision as of 21:21, 9 March 2021 by StefanoZacchiroli (talk | contribs) (initial page skeleton, still WIP)
WORK IN PROGRESS
Introduction
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.
The Software Heritage scanner (swh-scanner) (documentation, code) is a command line tool that enables doing that.
Task description
swh-scanner is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases.
Several improvements are possible:
- make
- use our in-memory graph database swh-graph to speed up fetching the necessary subgraphs.
- write cookers to output new formats (e.g git tarballs/git bundles or even other VCS?)
- improve unit and end-to-end testing
- other general code improvements (better progress/error reporting in the frontend, etc.)
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- Basic understanding of the Software Heritage data model and of SWHID identifiers
- JavaScript and front-end web development, if you want to work on the interactive dashboard
Potential mentors
- Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)