Difference between revisions of "Improve the Code Scanner (GSoC task)"
Jump to navigation
Jump to search
(remove WIP marker) |
(mark as no longer available as gsoc task) |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software. | As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software. | ||
− | The Software Heritage scanner (swh-scanner) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code]) is a command line tool that enables doing that. | + | The Software Heritage scanner (<code>swh-scanner</code>) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code]) is a command line tool that enables doing that. |
== Task description == | == Task description == | ||
− | swh-scanner is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases. | + | <code>swh-scanner</code> is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases. |
Several improvements are possible: | Several improvements are possible: | ||
+ | * integrate provenance information in results (using [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph] and/or [https://forge.softwareheritage.org/source/swh-provenance/ swh-provenance]) | ||
* make the different algorithms (see [https://forge.softwareheritage.org/source/swh-scanner/history/benchmark/ benchmark branch] in Git) used to query the backend user-selectable | * make the different algorithms (see [https://forge.softwareheritage.org/source/swh-scanner/history/benchmark/ benchmark branch] in Git) used to query the backend user-selectable | ||
* minimize the number of queries to the [https://archive.softwareheritage.org/api/1/known/doc/ /known API endpoint], in order to consume API rate limit less and be generally more efficient | * minimize the number of queries to the [https://archive.softwareheritage.org/api/1/known/doc/ /known API endpoint], in order to consume API rate limit less and be generally more efficient | ||
* be adaptive in how the backend is queried, e.g., for code trees that contain less than 1000 files it is more efficient to just query all of them at once, without following the DAG structure (even if it is in theory a faster approach) | * be adaptive in how the backend is queried, e.g., for code trees that contain less than 1000 files it is more efficient to just query all of them at once, without following the DAG structure (even if it is in theory a faster approach) | ||
* improve the web-based dashboard view (<code>--interactive</code>), making it more user friendly | * improve the web-based dashboard view (<code>--interactive</code>), making it more user friendly | ||
+ | * add support for scanning from within repositories (e.g., if the code is in git, we can lookup the commit ID directly) | ||
* add progress reporting during scanning, in particular for large code bases | * add progress reporting during scanning, in particular for large code bases | ||
* add on-disk caching, in particular for large code bases | * add on-disk caching, in particular for large code bases |
Latest revision as of 16:30, 13 February 2022
Introduction
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.
The Software Heritage scanner (swh-scanner
) (documentation, code) is a command line tool that enables doing that.
Task description
swh-scanner
is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases.
Several improvements are possible:
- integrate provenance information in results (using swh-graph and/or swh-provenance)
- make the different algorithms (see benchmark branch in Git) used to query the backend user-selectable
- minimize the number of queries to the /known API endpoint, in order to consume API rate limit less and be generally more efficient
- be adaptive in how the backend is queried, e.g., for code trees that contain less than 1000 files it is more efficient to just query all of them at once, without following the DAG structure (even if it is in theory a faster approach)
- improve the web-based dashboard view (
--interactive
), making it more user friendly - add support for scanning from within repositories (e.g., if the code is in git, we can lookup the commit ID directly)
- add progress reporting during scanning, in particular for large code bases
- add on-disk caching, in particular for large code bases
- integrate into the generated output other information available from the archive, e.g., license information, metadata, provenance information, etc.
- general code improvements, including refactoring and deduplication w.r.t. the rest of the Software Heritage code base (see open tasks)
Desirable skills
- Python 3 and Git are a must to work on any Software Heritage project
- Basic understanding of the Software Heritage data model and of SWHID identifiers
- JavaScript and front-end web development, if you want to work on the interactive dashboard
Potential mentors
- Stefano Zacchiroli <zack@upsilon.cc> (zack on IRC)