Improve the Code Scanner (GSoC task)

From Software Heritage Wiki
Revision as of 21:21, 9 March 2021 by StefanoZacchiroli (talk | contribs) (initial page skeleton, still WIP)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.

As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.

The Software Heritage scanner (swh-scanner) (documentation, code) is a command line tool that enables doing that.

Task description

swh-scanner is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases.

Several improvements are possible:

  • make
  • use our in-memory graph database swh-graph to speed up fetching the necessary subgraphs.
  • write cookers to output new formats (e.g git tarballs/git bundles or even other VCS?)
  • improve unit and end-to-end testing
  • other general code improvements (better progress/error reporting in the frontend, etc.)

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • Basic understanding of the Software Heritage data model and of SWHID identifiers
  • JavaScript and front-end web development, if you want to work on the interactive dashboard

Potential mentors

  • Stefano Zacchiroli <> (zack on IRC)