Improve the Code Scanner (GSoC task)
WORK IN PROGRESS
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.
swh-scanner is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases.
Several improvements are possible:
- use our in-memory graph database swh-graph to speed up fetching the necessary subgraphs.
- write cookers to output new formats (e.g git tarballs/git bundles or even other VCS?)
- improve unit and end-to-end testing
- other general code improvements (better progress/error reporting in the frontend, etc.)
- Python 3 and Git are a must to work on any Software Heritage project
- Basic understanding of the Software Heritage data model and of SWHID identifiers
- Stefano Zacchiroli <email@example.com> (zack on IRC)