<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=StefanoZacchiroli</id>
	<title>Software Heritage Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=StefanoZacchiroli"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/wiki/Special:Contributions/StefanoZacchiroli"/>
	<updated>2026-04-20T12:37:09Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=User:StefanoZacchiroli&amp;diff=1860</id>
		<title>User:StefanoZacchiroli</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=User:StefanoZacchiroli&amp;diff=1860"/>
		<updated>2024-09-18T11:15:21Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Heya, I'm Stefano Zacchiroli, or simply &amp;quot;Zack&amp;quot;, and I'm [[Software Heritage]] [https://www.softwareheritage.org/people/ co-founder and CSO].&lt;br /&gt;
&lt;br /&gt;
* Nickname: Zack&lt;br /&gt;
* Full name: Stefano Zacchiroli&lt;br /&gt;
* Homepage: http://upsilon.cc/~zack&lt;br /&gt;
* Mastodon: [https://mastodon.xyz/@zacchiro @zacchiro@mastodon.xyz]&lt;br /&gt;
* Email: [mailto:zack@upsilon.cc zack@upsilon.cc]&lt;br /&gt;
* GPG fingerprint: [http://pgp.cs.uu.nl/stats/6D866396.html 4900 707D DC5C 07F2 DECB  0283 9C31 503C 6D86 6396]&lt;br /&gt;
&lt;br /&gt;
== Scratch ==&lt;br /&gt;
&lt;br /&gt;
* [[/Content deduplication]]&lt;br /&gt;
* [[/Repository snapshot objects]]&lt;br /&gt;
* [[/Scratch]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1859</id>
		<title>Software Heritage in a bottle - local repository mining toolchain (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1859"/>
		<updated>2024-09-13T13:41:56Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The goal of this internship is to develop a fully-automated &amp;quot;mini [https://www.softwareheritage.org Software Heritage]&amp;quot; pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.&lt;br /&gt;
&lt;br /&gt;
This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:&lt;br /&gt;
&lt;br /&gt;
# mining of private (or hybrid) development forges, e.g., for [https://en.wikipedia.org/wiki/Inner_source inner source] industry scenarios&lt;br /&gt;
# testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the [https://swhsec.github.io/ SWHSec project]&lt;br /&gt;
&lt;br /&gt;
'''Objectives:'''&lt;br /&gt;
The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.&lt;br /&gt;
&lt;br /&gt;
'''Specific objectives:'''&lt;br /&gt;
# Scenarization: automate the creation and management of local Git repositories (for testing purposes).&lt;br /&gt;
# Deploy a local SWH ingestion pipeline&lt;br /&gt;
# Automate the initial, periodic, and on-demand (re)crawling of Git repositories&lt;br /&gt;
# Periodically and on-demand export all indexed data in the [https://docs.softwareheritage.org/devel/swh-dataset/graph/ same formats] exported by the SWH archive (compressed graph and ORC formats).&lt;br /&gt;
# Develop a local [https://docs.softwareheritage.org/devel/swh-objstorage/ SWH object storage] instance to allow accessing individual file contents.&lt;br /&gt;
# Automate all relevant workflows in a Continuous Integration (CI) environment.&lt;br /&gt;
&lt;br /&gt;
'''Expected outcomes:'''&lt;br /&gt;
By the end of the internship, we expect the following deliverables:&lt;br /&gt;
# A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.&lt;br /&gt;
# A minimal local server setup that replicates the API functionalities of the SWH project.&lt;br /&gt;
# Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.&lt;br /&gt;
# Documentation and guidelines on using the developed tools and reproducing the test scenarios.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Strong proficiency in Python.&lt;br /&gt;
* Knowledge of Rust is a plus.&lt;br /&gt;
* Familiarity with Git and related tools.&lt;br /&gt;
* Basic understanding of software version control and indexing.&lt;br /&gt;
* Experience with API development and server setup.&lt;br /&gt;
* Ability to work independently and collaboratively in a team environment.&lt;br /&gt;
&lt;br /&gt;
|workplace=on site at [https://www.telecom-paris.fr/en/home Télécom Paris] (contact mentors for alternative hosting options)&lt;br /&gt;
&lt;br /&gt;
|environment=The intern will be supervised by members of the SWH and SWHSec project teams, which includes experts in software preservation, security, and data management.&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Samuel Tardieu &amp;lt;samuel.tardieu@telecom-paris.fr&amp;gt; (Sam on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;stefano.zacchiroli@telecom-paris.fr&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1858</id>
		<title>Software Heritage in a bottle - local repository mining toolchain (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1858"/>
		<updated>2024-09-13T13:39:13Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The goal of this internship is to develop a fully-automated &amp;quot;mini [https://www.softwareheritage.org Software Heritage]&amp;quot; pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.&lt;br /&gt;
&lt;br /&gt;
This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:&lt;br /&gt;
&lt;br /&gt;
# mining of private (or hybrid) development forges, e.g., for [https://en.wikipedia.org/wiki/Inner_source inner source] industry scenarios&lt;br /&gt;
# testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the [https://swhsec.github.io/ SWHSec project]&lt;br /&gt;
&lt;br /&gt;
'''Objectives:'''&lt;br /&gt;
The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.&lt;br /&gt;
&lt;br /&gt;
'''Specific objectives:'''&lt;br /&gt;
# Scenarization: automate the creation and management of local Git repositories (for testing purposes).&lt;br /&gt;
# Deploy a local SWH ingestion pipeline&lt;br /&gt;
# Automate the initial, periodic, and on-demand (re)crawling of Git repositories&lt;br /&gt;
# Periodically and on-demand export all indexed data in the [https://docs.softwareheritage.org/devel/swh-dataset/graph/ same formats] exported by the SWH archive (compressed graph and ORC formats).&lt;br /&gt;
# Develop a local [https://docs.softwareheritage.org/devel/swh-objstorage/ SWH object storage] instance to allow accessing individual file contents.&lt;br /&gt;
# Automate all relevant workflows in a Continuous Integration (CI) environment.&lt;br /&gt;
&lt;br /&gt;
'''Expected outcomes:'''&lt;br /&gt;
By the end of the internship, we expect the following deliverables:&lt;br /&gt;
# A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.&lt;br /&gt;
# A minimal local server setup that replicates the API functionalities of the SWH project.&lt;br /&gt;
# Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.&lt;br /&gt;
# Documentation and guidelines on using the developed tools and reproducing the test scenarios.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Strong proficiency in Python.&lt;br /&gt;
* Knowledge of Rust is a plus.&lt;br /&gt;
* Familiarity with Git and related tools.&lt;br /&gt;
* Basic understanding of software version control and indexing.&lt;br /&gt;
* Experience with API development and server setup.&lt;br /&gt;
* Ability to work independently and collaboratively in a team environment.&lt;br /&gt;
&lt;br /&gt;
|workplace=on site at [https://www.telecom-paris.fr/en/home Télécom Paris] (contact mentors for remote opportunities)&lt;br /&gt;
&lt;br /&gt;
|environment=The intern will be supervised by members of the SWH and SWHSec project teams, which includes experts in software preservation, security, and data management.&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Samuel Tardieu &amp;lt;samuel.tardieu@telecom-paris.fr&amp;gt; (Sam on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;stefano.zacchiroli@telecom-paris.fr&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1857</id>
		<title>Software Heritage in a bottle - local repository mining toolchain (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1857"/>
		<updated>2024-09-13T13:37:29Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: StefanoZacchiroli moved page Software Heritage in a bottle (internship) to Software Heritage in a bottle - local repository mining toolchain (internship) without leaving a redirect&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The goal of this internship is to develop a fully-automated &amp;quot;mini Software Heritage&amp;quot; pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.&lt;br /&gt;
&lt;br /&gt;
This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:&lt;br /&gt;
&lt;br /&gt;
# mining of private (or hybrid) development forges, e.g., for [https://en.wikipedia.org/wiki/Inner_source inner source] industry scenarios&lt;br /&gt;
# testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the [https://swhsec.github.io/ SWHSec project]&lt;br /&gt;
&lt;br /&gt;
'''Objectives:'''&lt;br /&gt;
The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.&lt;br /&gt;
&lt;br /&gt;
'''Specific objectives:'''&lt;br /&gt;
# Scenarization: automate the creation and management of local Git repositories (for testing purposes).&lt;br /&gt;
# Deploy a local SWH ingestion pipeline&lt;br /&gt;
# Automate the initial, periodic, and on-demand (re)crawling of Git repositories&lt;br /&gt;
# Periodically and on-demand export all indexed data in the same formats exported by the SWH archive (compressed graph and ORC formats).&lt;br /&gt;
# Develop a local SWH object storage instance to allow accessing individual file contents.&lt;br /&gt;
# Automate all relevant workflows in a Continuous Integration (CI) environment.&lt;br /&gt;
&lt;br /&gt;
'''Expected outcomes:'''&lt;br /&gt;
By the end of the internship, we expect the following deliverables:&lt;br /&gt;
# A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.&lt;br /&gt;
# A minimal local server setup that replicates the API functionalities of the SWH project.&lt;br /&gt;
# Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.&lt;br /&gt;
# Documentation and guidelines on using the developed tools and reproducing the test scenarios.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Strong proficiency in Python.&lt;br /&gt;
* Knowledge of Rust is a plus.&lt;br /&gt;
* Familiarity with Git and related tools.&lt;br /&gt;
* Basic understanding of software version control and indexing.&lt;br /&gt;
* Experience with API development and server setup.&lt;br /&gt;
* Ability to work independently and collaboratively in a team environment.&lt;br /&gt;
&lt;br /&gt;
|workplace=on site at [https://www.telecom-paris.fr/en/home Télécom Paris] (contact mentors for remote opportunities)&lt;br /&gt;
&lt;br /&gt;
|environment=The intern will be supervised by members of the SWH and SWHSec project teams, which includes experts in software preservation, security, and data management.&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Samuel Tardieu &amp;lt;samuel.tardieu@telecom-paris.fr&amp;gt; (Sam on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;stefano.zacchiroli@telecom-paris.fr&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1856</id>
		<title>Software Heritage in a bottle - local repository mining toolchain (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1856"/>
		<updated>2024-09-13T13:36:37Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The goal of this internship is to develop a fully-automated &amp;quot;mini Software Heritage&amp;quot; pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.&lt;br /&gt;
&lt;br /&gt;
This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:&lt;br /&gt;
&lt;br /&gt;
# mining of private (or hybrid) development forges, e.g., for [https://en.wikipedia.org/wiki/Inner_source inner source] industry scenarios&lt;br /&gt;
# testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the [https://swhsec.github.io/ SWHSec project]&lt;br /&gt;
&lt;br /&gt;
'''Objectives:'''&lt;br /&gt;
The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.&lt;br /&gt;
&lt;br /&gt;
'''Specific objectives:'''&lt;br /&gt;
# Scenarization: automate the creation and management of local Git repositories (for testing purposes).&lt;br /&gt;
# Deploy a local SWH ingestion pipeline&lt;br /&gt;
# Automate the initial, periodic, and on-demand (re)crawling of Git repositories&lt;br /&gt;
# Periodically and on-demand export all indexed data in the same formats exported by the SWH archive (compressed graph and ORC formats).&lt;br /&gt;
# Develop a local SWH object storage instance to allow accessing individual file contents.&lt;br /&gt;
# Automate all relevant workflows in a Continuous Integration (CI) environment.&lt;br /&gt;
&lt;br /&gt;
'''Expected outcomes:'''&lt;br /&gt;
By the end of the internship, we expect the following deliverables:&lt;br /&gt;
# A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.&lt;br /&gt;
# A minimal local server setup that replicates the API functionalities of the SWH project.&lt;br /&gt;
# Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.&lt;br /&gt;
# Documentation and guidelines on using the developed tools and reproducing the test scenarios.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Strong proficiency in Python.&lt;br /&gt;
* Knowledge of Rust is a plus.&lt;br /&gt;
* Familiarity with Git and related tools.&lt;br /&gt;
* Basic understanding of software version control and indexing.&lt;br /&gt;
* Experience with API development and server setup.&lt;br /&gt;
* Ability to work independently and collaboratively in a team environment.&lt;br /&gt;
&lt;br /&gt;
|workplace=on site at [https://www.telecom-paris.fr/en/home Télécom Paris] (contact mentors for remote opportunities)&lt;br /&gt;
&lt;br /&gt;
|environment=The intern will be supervised by members of the SWH and SWHSec project teams, which includes experts in software preservation, security, and data management.&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Samuel Tardieu &amp;lt;samuel.tardieu@telecom-paris.fr&amp;gt; (Sam on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;stefano.zacchiroli@telecom-paris.fr&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1855</id>
		<title>Software Heritage in a bottle - local repository mining toolchain (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Software_Heritage_in_a_bottle_-_local_repository_mining_toolchain_(internship)&amp;diff=1855"/>
		<updated>2024-09-13T13:33:51Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: Created page with &amp;quot;{{Internship |description=The goal of this internship is to develop a fully-automated &amp;quot;mini Software Heritage&amp;quot; pipeline, capable of crawling a (potentially large) set of Versi...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The goal of this internship is to develop a fully-automated &amp;quot;mini Software Heritage&amp;quot; pipeline, capable of crawling a (potentially large) set of Version Control System repositories and store all their information into a local deployment of Software Heritage software components.&lt;br /&gt;
&lt;br /&gt;
This will allow analyzing all crawled data locally in an efficient manner. Some sample intended use cases for this are:&lt;br /&gt;
&lt;br /&gt;
# mining of private (or hybrid) development forges, e.g., for [https://en.wikipedia.org/wiki/Inner_source inner source] industry scenarios&lt;br /&gt;
# testing and validation of mining software repository (MSR) analyses, e.g., for the research needs of the [https://swhsec.github.io/ SWHSec project]&lt;br /&gt;
&lt;br /&gt;
'''Objectives:'''&lt;br /&gt;
The primary goal of this internship is to develop scripts and tools to generate a local, fully functional mini Software Heritage environment. This environment will mimic the larger SWH infrastructure and support complete export and import functionalities. It will be used to test algorithms and validate scenarios by applying various manipulations to software repositories.&lt;br /&gt;
&lt;br /&gt;
'''Specific objectives:'''&lt;br /&gt;
# Scenarization: automate the creation and management of local Git repositories (for testing purposes).&lt;br /&gt;
# Deploy a local SWH ingestion pipeline&lt;br /&gt;
# Automate the initial, periodic, and on-demand (re)crawling of Git repositories&lt;br /&gt;
# Periodically and on-demand export all indexed data in the same formats exported by the SWH archive (compressed graph and ORC formats).&lt;br /&gt;
# Develop a local SWH object storage instance to allow accessing individual file contents.&lt;br /&gt;
# Automate all relevant workflows in a Continuous Integration (CI) environment.&lt;br /&gt;
&lt;br /&gt;
'''Expected outcomes:'''&lt;br /&gt;
By the end of the internship, we expect the following deliverables:&lt;br /&gt;
# A set of scripts and tools to create local Git repositories, apply manipulations, index them, and export the data.&lt;br /&gt;
# A minimal local server setup that replicates the API functionalities of the SWH project.&lt;br /&gt;
# Automated test cases integrated into a CI system to ensure ongoing validation of different manipulation scenarios.&lt;br /&gt;
# Documentation and guidelines on using the developed tools and reproducing the test scenarios.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Strong proficiency in Python.&lt;br /&gt;
* Knowledge of Rust is a plus.&lt;br /&gt;
* Familiarity with Git and related tools.&lt;br /&gt;
* Basic understanding of software version control and indexing.&lt;br /&gt;
* Experience with API development and server setup.&lt;br /&gt;
* Ability to work independently and collaboratively in a team environment.&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Samuel Tardieu &amp;lt;samuel.tardieu@telecom-paris.fr&amp;gt; (Sam on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;stefano.zacchiroli@telecom-paris.fr&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Template:Internship_context&amp;diff=1854</id>
		<title>Template:Internship context</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Template:Internship_context&amp;diff=1854"/>
		<updated>2024-09-13T10:05:45Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://www.softwareheritage.org/ Software Heritage] is an ambitious initiative whose goal is to collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it.&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Source_code_search_engine_prototype_(internship)&amp;diff=1823</id>
		<title>Source code search engine prototype (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Source_code_search_engine_prototype_(internship)&amp;diff=1823"/>
		<updated>2024-02-04T15:11:59Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The current [https://archive.softwareheritage.org/browse/search/ archive search engine] supports searching archived projects (or &amp;quot;origins&amp;quot;) via their URL or metadata.&lt;br /&gt;
We would like to extend search to also support searching within archived source code files, based on their textual content.&lt;br /&gt;
Indexing all source code files archived by Software Heritage (~500-600 TB) is a major undertaking in terms of indexing time and storage.&lt;br /&gt;
The goal of this internship is to design and implement a medium-scale prototype of such an index (covering, e.g., 0.1 to 1% of the archive) that will allow to evaluate the best indexing approach (e.g., which kind of index, tokenizer, etc.) as well as the time and resources that doing so will require (e.g., via extrapolation).&lt;br /&gt;
The first technology that will be tried is [https://www.elastic.co/ ElasticSearch], as it's already deployed for other Software Heritage needs, but depending on the candidate other search engines can also be tested in the context of the internship.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* database administration experience (SQL and/or NoSQL and/or document-oriented)&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with full-text or code search&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[Matrix]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Reverse_project_phylogenesis_(internship)&amp;diff=1822</id>
		<title>Reverse project phylogenesis (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Reverse_project_phylogenesis_(internship)&amp;diff=1822"/>
		<updated>2024-02-04T15:11:37Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The [https://archive.softwareheritage.org/ Software Heritage Archive] contains a large number of projects (over one million!) salvaged from code hosting platforms that have been closed down, ranging from large ones like Google Code or Gitorious.org, to small institutional ones, like the old Inria gForge, and from platforms that have phased out support for some version control systems, like Bitbucket.&lt;br /&gt;
Some of these projects migrated to other platforms, where they continued their development.&lt;br /&gt;
&lt;br /&gt;
This means that there are documents, research articles, blog posts, documentation, and many other sources out there that contain broken links: the Software Heritage archive provides a way to find easily the archived version of the project, but does not help identifying the new repository where the development may have migrated.&lt;br /&gt;
&lt;br /&gt;
The goal of this internship is to explore heuristics that exploit the special feature of the Software Heritage merkle graph to identify repositories that may be the new development strand of an old repository saved from a discontinued platform, and show these links in the relevant repositories: this corresponds to developing the [https://www.vocabulary.com/dictionary/phylogenesis phylogenesis] of a software project.&lt;br /&gt;
&lt;br /&gt;
One of the challenges will be to compare various heuristics and scale the approach up to the millions of repositories involved.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* understanding of version control systems (git in particular) and familiarity with code hosting platforms&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Roberto Di Cosmo &amp;lt;roberto@dicosmo.org&amp;gt; (rdicosmo on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Large-scale_license_text_recognition_(internship)/fr&amp;diff=1821</id>
		<title>Large-scale license text recognition (internship)/fr</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Large-scale_license_text_recognition_(internship)/fr&amp;diff=1821"/>
		<updated>2024-02-04T15:11:03Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(see: [[Large-scale_license_text_recognition_(internship)|english version]] of this internship description)&lt;br /&gt;
----&lt;br /&gt;
{{Internship&lt;br /&gt;
|description=Plusieurs solutions open source existent pour reconnaître la licence logicielle déclarée dans un fichier source, p.ex.: [https://www.fossology.org/ Fossology], [https://github.com/nexB/scancode-toolkit ScanCode], [http://ninka.turingmachine.org/ Ninka].&lt;br /&gt;
La pluspart des ces solutions sont basées sur des heuristiques qui ont été maintenues au fil des années sur la base des [https://spdx.org/licenses/ licences] (open source ou pas) existantes.&lt;br /&gt;
Seulement récemment les techniques d'apprentissage automatiques ont été appliquées au problème de reconnaissance des licences, avec des prototypes comme [https://github.com/fossology/FOSSologyML FOSSologyML].&lt;br /&gt;
Le but de ce stage est d'appliquer des techniques d'apprentissage automatique à un sous-problème spécifique de la reconnaissance des licences: classifier un texte complet de licence (et pas seulement une déclaration courte de licence, comme on peut le trouver dans les entêtes des fichiers source), comme on peut le trouver dans des fichiers comme &amp;lt;code&amp;gt;LICENSE&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;COPYING&amp;lt;/code&amp;gt;, etc., à la racine des dépôts logiciels, et de le faire à l'échelle de Software Heritage.&lt;br /&gt;
Un jeu de données de ces fichiers sera extrait depuis l'archive, et on expérimentera avec plusieurs techniques d'apprentissage automatique pour identifier la méthode la plus efficace et performante.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* savoir développer en Python&lt;br /&gt;
* une expérience avec l'apprentissage automatique&lt;br /&gt;
* connaissance de un ou plusieurs framework d'apprentissage automatique (p.ex., [https://keras.io/ Keras], [https://www.tensorflow.org/ TensorFlow], [https://scikit-learn.org/stable/ scikit-learn])&lt;br /&gt;
&lt;br /&gt;
Est considéré comme un plus:&lt;br /&gt;
* une expérience avec la [https://fr.wikipedia.org/wiki/Traitement_automatique_des_langues traitement automatique des langues (TAL)]&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Large-scale_license_text_recognition_(internship)&amp;diff=1820</id>
		<title>Large-scale license text recognition (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Large-scale_license_text_recognition_(internship)&amp;diff=1820"/>
		<updated>2024-02-04T15:10:42Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(voir aussi: [[Large-scale license text recognition (internship)/fr|version française]] du sujet)&lt;br /&gt;
----&lt;br /&gt;
{{Internship&lt;br /&gt;
|description=A number of free/open source software (FOSS) tools are available to automatically detect software licenses declared in source code files, e.g., [https://www.fossology.org/ Fossology], [https://github.com/nexB/scancode-toolkit ScanCode], [http://ninka.turingmachine.org/ Ninka].&lt;br /&gt;
Most of them rely on carefully maintained heuristics that have been tuned over many years to detect [https://spdx.org/licenses/ licenses] (FOSS or otherwise) that can be found in the wild.&lt;br /&gt;
Only relatively recently machine-learning techniques have been applied to the license-detection problem, in prototypes like [https://github.com/fossology/FOSSologyML FOSSologyML].&lt;br /&gt;
The goal of this internship is to apply machine-learning techniques to a limited sub-problem of license-detection, i.e., recognizing full license texts as they are commonly found in top-level files such as &amp;lt;code&amp;gt;LICENSE&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;COPYING&amp;lt;/code&amp;gt;, etc., at the scale of Software Heritage.&lt;br /&gt;
''All'' such files will be extracted from the archive, and suitable machine learning models will be designed and tested on the obtained corpus.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* machine learning training and experience&lt;br /&gt;
* working knowledge of one or more machine learning frameworks (e.g., [https://keras.io/ Keras], [https://www.tensorflow.org/ TensorFlow], [https://scikit-learn.org/stable/ scikit-learn])&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Natural_language_processing natural language processing (NLP)] training and experience&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)/fr&amp;diff=1819</id>
		<title>Language and infrastructure for analyzing the archive (internship)/fr</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)/fr&amp;diff=1819"/>
		<updated>2024-02-04T15:10:21Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(see: [[Language_and_infrastructure_for_analyzing_the_archive_(internship)|english version]] of this internship description)&lt;br /&gt;
----&lt;br /&gt;
{{Internship&lt;br /&gt;
|description=&lt;br /&gt;
&lt;br /&gt;
L'archive de Software Heritage est structurée comme un graphe (plus précisément, un arbre de Merkle), et ce&lt;br /&gt;
graphe est énorme :&lt;br /&gt;
des dizaines de milliards de noeuds, des centaines de milliards d'arêtes. Le graphe nous montre que de nombreux&lt;br /&gt;
commits différents nous mènent vers les mêmes fichiers et repertoires de fichiers. Et des mêmes commits peuvent&lt;br /&gt;
être atteins a partir de plusieurs dépôts (des dépôts qui sont des forks d'autres dépôts).&lt;br /&gt;
&lt;br /&gt;
Lors de l'analyse de code source a très grande échelle (par exemple, tout les projets hébergés sur GitHub), le fait&lt;br /&gt;
qu'il existe une abondance des données partagées entraîne un gaspillage de ressource et de temps.&lt;br /&gt;
&lt;br /&gt;
L'objectif de ce stage est de concevoir et de mettre en oeuvre un prototype de plateforme (inspiré de Boa) qui&lt;br /&gt;
nous permettrais de décrire des experiences empirique a lancer sur Software heritage, en exploitant les données&lt;br /&gt;
partagées dans le graphe pour accélérer les analyse.&lt;br /&gt;
La plateforme sera constituée d'un langage simple pour décrire les experiences, et d'un runtime implémentant le&lt;br /&gt;
langage qui gère de manière transparente la mise en cache des résultats précédents.&lt;br /&gt;
L'objectif est ambitieux : le runtime distribueras le calcul sur une seule machine ou un cluster.&lt;br /&gt;
&lt;br /&gt;
Si l'implémentation est un succès, le stage se termineras avec une démonstration (sous la forme d'un article) des&lt;br /&gt;
performances, en pratique, de notre solution par rapport a une implémentation naive.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* savoir développer en Python&lt;br /&gt;
* une experience avec la programmation fonctionnelle&lt;br /&gt;
&lt;br /&gt;
Est considéré comme un plus:&lt;br /&gt;
* une experience avec la théorie/implémentation d'un langage de programmation&lt;br /&gt;
* une experience avec le modèle de calcul MapReduce&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* [https://upsilon.cc/~zack Stefano Zacchiroli] (zack@upsilon.cc, Zack on [[Matrix]])&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)&amp;diff=1818</id>
		<title>Language and infrastructure for analyzing the archive (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)&amp;diff=1818"/>
		<updated>2024-02-04T15:09:57Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(voir aussi: [[Language_and_infrastructure_for_analyzing_the_archive_(internship)/fr|version française]] du sujet)&lt;br /&gt;
----&lt;br /&gt;
{{Internship&lt;br /&gt;
|description=&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is structured as a graph (specifically, a&lt;br /&gt;
[https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: tens of&lt;br /&gt;
billion nodes, hundreds of billion edges.  The graph exhibits a lot of sharing:&lt;br /&gt;
the same source code files and directories can be reached starting from many&lt;br /&gt;
different commits (e.g., different commits in the same repository), and the&lt;br /&gt;
same commits can be reached starting from many different repositories (e.g.,&lt;br /&gt;
repositories that are &amp;quot;forks&amp;quot; of one another).&lt;br /&gt;
When analyzing source code at a very large-scale (e.g, all the commits of the&lt;br /&gt;
same large repository, or even all projects hosted on GitHub) it is pointless,&lt;br /&gt;
and a waste of resources, to re-analyze source code artifacts already analyzed&lt;br /&gt;
in the past, and encountered again in the future due to sharing in the&lt;br /&gt;
graph.&lt;br /&gt;
&lt;br /&gt;
The goal of this internship is to design and implement a prototype platform&lt;br /&gt;
(similar in spirit to [http://design.cs.iastate.edu/papers/ICSE-13/icse13.pdf Boa])&lt;br /&gt;
that allows to describe empirical experiments to be run on the Software&lt;br /&gt;
Heritage archive, exploiting artifact sharing as a way to speed up the&lt;br /&gt;
analysis. The platform will constitute of a simple language to describe&lt;br /&gt;
experiments (e.g., &amp;quot;start from these repositories and run this script on all&lt;br /&gt;
files in each commit&amp;quot;) and of a runtime implementing the language that&lt;br /&gt;
transparently handles caching of previous results.&lt;br /&gt;
As a stretch goal: the runtime will delegate actual compute to multiple&lt;br /&gt;
workers, running either on a single machine or distributed over a cluster.&lt;br /&gt;
&lt;br /&gt;
If successfully implemented, the internship will conclude with a demonstration&lt;br /&gt;
(e.g., in the form of a paper) benchmarking in practice the performance&lt;br /&gt;
advantages of the proposed approach over a naive implementation.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* experience with functional programming&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with programming language theory and implementation&lt;br /&gt;
* experience with the [https://en.wikipedia.org/wiki/MapReduce MapReduce] programming model&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_ClearlyDefined_(internship)&amp;diff=1817</id>
		<title>Integrate Software Heritage and ClearlyDefined (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_ClearlyDefined_(internship)&amp;diff=1817"/>
		<updated>2024-02-04T15:09:31Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=[https://clearlydefined.io/ ClearlyDefined] is a project whose goal is to collaboratively and semi-automatically curate information about Free/Open Source Software (FOSS) projects, including licensing and vulnerability information.&lt;br /&gt;
As one of its main output, ClearyDefined maintains an open data knowledge-base that cross references FOSS source code artifacts found in version control systems, package repositories, etc. to curated information about their licenses and vulnerabilities. The same source code artifacts are archived by Software Heritage for long-term preservation purposes.&lt;br /&gt;
The goal of this internship is to integrate ClearlyDefined and Software Heritage, for mutual benefit.&lt;br /&gt;
Software Heritage will benefit from mirroring ClearlyDefined data, allowing to query them while navigating the archive and at scale; ClearlyDefined will benefit from learning about the existing of FOSS projects that have not been analyzed for &amp;quot;clarity&amp;quot; yet.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* JavaScript / NodeJS&lt;br /&gt;
* Python&lt;br /&gt;
* experience with database management systems (of any kind)&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Valentin Lorentz&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
&lt;br /&gt;
|environment=&lt;br /&gt;
you will work shoulder to shoulder with members of the [https://www.softwareheritage.org/people/ Software Heritage] and [https://clearlydefined.io/about ClearlyDefined project], with mentors from both projects&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== Clearlydefined ==&lt;br /&gt;
&lt;br /&gt;
* [[ClearlyDefinedObject]]&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Ingest_Wikidata_software_origins_(internship)&amp;diff=1816</id>
		<title>Ingest Wikidata software origins (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Ingest_Wikidata_software_origins_(internship)&amp;diff=1816"/>
		<updated>2024-02-04T15:09:09Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The Software Heritage archive currently contains source code&lt;br /&gt;
coming mostly from major development forges and distributions.&lt;br /&gt;
[https://www.wikidata.org/ Wikidata] is a free and open knowledge base about&lt;br /&gt;
''everything'', including software development projects. The goal of this&lt;br /&gt;
internship is to list ''software origins'' described in Wikidata (in particular,&lt;br /&gt;
but not only, version control system) and make sure they get periodically crawled&lt;br /&gt;
and ingested into the Software Heritage archive.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* familiarity with the Version Control Systems&lt;br /&gt;
* familiarity with Wikipedia and/or Wikidata&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* TBD (ask on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Ingest_all_Debian_derivatives_(internship)&amp;diff=1815</id>
		<title>Ingest all Debian derivatives (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Ingest_all_Debian_derivatives_(internship)&amp;diff=1815"/>
		<updated>2024-02-04T15:08:39Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The Software Heritage archive currently archives&lt;br /&gt;
[https://www.debian.org Debian] as the only supported GNU/Linux distribution.&lt;br /&gt;
The goal of this internship is to extend archive coverage to all Debian&lt;br /&gt;
derivatives, as listed in the&lt;br /&gt;
[https://wiki.debian.org/Derivatives/Census Debian Derivatives Census].&lt;br /&gt;
As part of this internship Software Heritage will be integrated with the&lt;br /&gt;
Derivatives Census so that distributions listed there are automatically and&lt;br /&gt;
periodically archived.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* familiarity with the Debian distribution&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Nicolas Dandrimont (olasd on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1814</id>
		<title>Graph query language for the archive (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1814"/>
		<updated>2024-02-04T15:08:26Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The Software Heritage archive is structured as a graph (specifically, a [https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: 20 billion nodes, 200 billion edges.&lt;br /&gt;
It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: [https://docs.softwareheritage.org/devel/swh-graph/ documentation] and in particular the [https://arxiv.org/abs/2001.08647 SANER 2020 paper referenced there]).&lt;br /&gt;
An [https://docs.softwareheritage.org/devel/swh-graph/api.html ad hoc API] is available to traverse the graph, with very limited querying capabilities.&lt;br /&gt;
The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art [https://en.wikipedia.org/wiki/Graph_database#Graph_query-programming_languages graph query languages].&lt;br /&gt;
&lt;br /&gt;
The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps.&lt;br /&gt;
A prototype implementation will be developed for a target platform to be chosen during the internship.&lt;br /&gt;
Tentative target platforms include: [https://neo4j.com/ Neo4j] and [http://tinkerpop.apache.org/ Apache TinkerPop].&lt;br /&gt;
In either case a backend/compatibility layer for WebGraph [http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html ImmutableGraph] will be developed.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* Query languages for structured, semi-structured, or graph data (e.g., one or more among: SQL, Xquery, GraphQL, SPARQL, GQL, etc.)&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with graph traversal languages (e.g., [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin])&lt;br /&gt;
* experience with graph databases (e.g., [https://neo4j.com/ Neo4j])&lt;br /&gt;
&lt;br /&gt;
|workplace=[https://liris.cnrs.fr/en LIRIS (Univ. Lyon 1, Lyon)] or [https://www.inria.fr/en/centre-inria-de-paris Inria Paris] or contact mentors for remote opportunities&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* [https://perso.liris.cnrs.fr/angela.bonifati/ Angela Bonifati] &amp;lt;angela.bonifati@univ-lyon1.fr&amp;gt;&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Fine-grained_tracking_of_source_code_provenance_(internship)&amp;diff=1813</id>
		<title>Fine-grained tracking of source code provenance (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Fine-grained_tracking_of_source_code_provenance_(internship)&amp;diff=1813"/>
		<updated>2024-02-04T15:07:44Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: archive internship&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=&lt;br /&gt;
Software Heritage is the largest existing public archive of software source code, which also keeps track of where and when source code files have been observed in the wild.&lt;br /&gt;
Given the checksum of a source code file and the current state of the archive, one can produce a list of all the places where said file has been (publicly) published in the past.&lt;br /&gt;
The goal of this internship is to experiment with increasing the granularity at which source code can be tracked, from entire files (current solution) to code snippets and/or individual lines of code.&lt;br /&gt;
Different techniques will be explored, implemented, and benchmarked on archive subsets to estimate their viability.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with source code indexing and/or search&lt;br /&gt;
* experience with software audit solutions (for license compliance issues, security vulnerabilities, etc.)&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Guillaume Rousseau &amp;lt;guillaume.rousseau@univ-paris-diderot.fr&amp;gt;&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Archived internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Expand_package_metadata_coverage_(internship)&amp;diff=1812</id>
		<title>Expand package metadata coverage (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Expand_package_metadata_coverage_(internship)&amp;diff=1812"/>
		<updated>2024-02-04T15:07:27Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=[https://archive.softwareheritage.org/browse/search/ searching] projects in the Software Heritage archive is currently possible by either (parts of) URL or by [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ package metadata].&lt;br /&gt;
Currently, only a limited number of package metadata are [https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html#supported-intrinsic-metadata supported], including Maven, NPM, PyPI, and Gems.&lt;br /&gt;
The goal of this internship is to extend the coverage of supported metadata to additional package managers, the long-term goal being supporting all [https://libraries.io/ Libraries.io]-indexed package managers.&lt;br /&gt;
&lt;br /&gt;
For more information of the existing tools, you can read our [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ metadata blog post] or dive into the [https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html#adding-support-for-additional-ecosystem-specific-metadata technical tutorial]&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* knowledge of linked data technologies and ontologies (e.g., RDFa, JSON-LD, OWL, etc.)&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Morane Gruenpeter (moranegg on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1811</id>
		<title>Graph query language for the archive (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1811"/>
		<updated>2024-02-04T15:06:20Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: re-enable&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The Software Heritage archive is structured as a graph (specifically, a [https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: 20 billion nodes, 200 billion edges.&lt;br /&gt;
It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: [https://docs.softwareheritage.org/devel/swh-graph/ documentation] and in particular the [https://arxiv.org/abs/2001.08647 SANER 2020 paper referenced there]).&lt;br /&gt;
An [https://docs.softwareheritage.org/devel/swh-graph/api.html ad hoc API] is available to traverse the graph, with very limited querying capabilities.&lt;br /&gt;
The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art [https://en.wikipedia.org/wiki/Graph_database#Graph_query-programming_languages graph query languages].&lt;br /&gt;
&lt;br /&gt;
The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps.&lt;br /&gt;
A prototype implementation will be developed for a target platform to be chosen during the internship.&lt;br /&gt;
Tentative target platforms include: [https://neo4j.com/ Neo4j] and [http://tinkerpop.apache.org/ Apache TinkerPop].&lt;br /&gt;
In either case a backend/compatibility layer for WebGraph [http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html ImmutableGraph] will be developed.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* Query languages for structured, semi-structured, or graph data (e.g., one or more among: SQL, Xquery, GraphQL, SPARQL, GQL, etc.)&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with graph traversal languages (e.g., [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin])&lt;br /&gt;
* experience with graph databases (e.g., [https://neo4j.com/ Neo4j])&lt;br /&gt;
&lt;br /&gt;
|workplace=[https://liris.cnrs.fr/en LIRIS (Univ. Lyon 1, Lyon)] or [https://www.inria.fr/en/centre-inria-de-paris Inria Paris] or contact mentors for remote opportunities&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* [https://perso.liris.cnrs.fr/angela.bonifati/ Angela Bonifati] &amp;lt;angela.bonifati@univ-lyon1.fr&amp;gt;&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Python_bindings_for_WebGraph_(internship)&amp;diff=1810</id>
		<title>Python bindings for WebGraph (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Python_bindings_for_WebGraph_(internship)&amp;diff=1810"/>
		<updated>2024-02-04T15:05:50Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Software Heritage uses the [http://webgraph.di.unimi.it/ WebGraph] framework for graph compression. This allows to manipulate the huge archive Merkle DAG in RAM efficiently, via the [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph component]. WebGraph being written in Java, we would like to have Python bindings for both Software Heritage needs and to allow researchers used to the [https://www.scipy.org/ Python scientific ecosystem] to analyze and exploit the [https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ compressed representation of the Software Heritage archive].&lt;br /&gt;
The goal of this internship is to design and implement efficient Python bindings to the WebGraph framework, using the most appropriate [https://wiki.python.org/moin/IntegratingPythonWithOtherLanguages#Java bridge technology] between the two languages.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with system programming in the C language&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Pietri (seirl on [[IRC]])&lt;br /&gt;
* [http://vigna.di.unimi.it/ Sebastiano Vigna]&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Archived internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Python_bindings_for_WebGraph_(internship)&amp;diff=1809</id>
		<title>Python bindings for WebGraph (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Python_bindings_for_WebGraph_(internship)&amp;diff=1809"/>
		<updated>2024-02-04T15:04:49Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: disable internship (obsolete now with webgraph-rs)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Software Heritage uses the [http://webgraph.di.unimi.it/ WebGraph] framework for graph compression. This allows to manipulate the huge archive Merkle DAG in RAM efficiently, via the [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph component]. WebGraph being written in Java, we would like to have Python bindings for both Software Heritage needs and to allow researchers used to the [https://www.scipy.org/ Python scientific ecosystem] to analyze and exploit the [https://annex.softwareheritage.org/public/dataset/graph/latest/compressed/ compressed representation of the Software Heritage archive].&lt;br /&gt;
The goal of this internship is to design and implement efficient Python bindings to the WebGraph framework, using the most appropriate [https://wiki.python.org/moin/IntegratingPythonWithOtherLanguages#Java bridge technology] between the two languages.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with system programming in the C language&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Pietri (seirl on [[IRC]])&lt;br /&gt;
* [http://vigna.di.unimi.it/ Sebastiano Vigna]&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Git_remote_support_for_Software_Heritage_(internship)&amp;diff=1808</id>
		<title>Git remote support for Software Heritage (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Git_remote_support_for_Software_Heritage_(internship)&amp;diff=1808"/>
		<updated>2024-02-04T15:04:13Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Content from the Software Heritage [https://archive.softwareheritage.org/ archive] can be accessed object-by-object (e.g., one file, one commit, etc.) or reconstructed asynchronously in batch using [https://docs.softwareheritage.org/devel/swh-vault/index.html the Vault service] to download a tarball or a [https://git-scm.com/docs/git-bundle git bundle].&lt;br /&gt;
For local use it would be more convenient to have direct support in [https://git-scm.com/ git], so that developers can directly &amp;lt;code&amp;gt;git clone&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;git pull&amp;lt;/code&amp;gt; from the Software Heritage archive, using just git.&lt;br /&gt;
The goal of this internship is to explore the possibility, design, and implement prototype support in Git and Software Heritage to enable that.&lt;br /&gt;
The intended building blocks that we plan to leverage to this end are git [https://git-scm.com/docs/partial-clone partial clones] and [https://git-scm.com/docs/gitremote-helpers remote helpers].&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* C development&lt;br /&gt;
* Python development&lt;br /&gt;
* Git&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with [https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain Git internals and plumbing]&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Zack on [[Matrix]])&lt;br /&gt;
* Christian Couder&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Main_Page&amp;diff=1807</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Main_Page&amp;diff=1807"/>
		<updated>2024-02-04T15:03:21Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Students */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Welcome to the wiki of [[Software Heritage]].&lt;br /&gt;
&lt;br /&gt;
We are just getting started, so please bear with us while we organize the content of the wiki.&amp;lt;br &amp;gt;&lt;br /&gt;
In the meantime you can find below entry points for various communities we're collaborating with.&lt;br /&gt;
&lt;br /&gt;
== General ==&lt;br /&gt;
&lt;br /&gt;
* [[Suggestion box]] for software we should add to the [[Archive]] ← add your entry here!&lt;br /&gt;
* [[Talks]] about Software Heritage&lt;br /&gt;
&lt;br /&gt;
== Developers ==&lt;br /&gt;
&lt;br /&gt;
* Read the [https://docs.softwareheritage.org/devel/ docs]&lt;br /&gt;
* Dive into the [https://forge.softwareheritage.org/ code]&lt;br /&gt;
* Subscribe to the [[Mailing lists]]&lt;br /&gt;
* Chat with us on [[Matrix]]&lt;br /&gt;
&lt;br /&gt;
== Scientists ==&lt;br /&gt;
&lt;br /&gt;
* [[:Category:Related work|Related work]]&lt;br /&gt;
* [[Working groups]]&lt;br /&gt;
&lt;br /&gt;
== Students ==&lt;br /&gt;
&lt;br /&gt;
* [[Internships|Internship opportunities]]&lt;br /&gt;
&lt;br /&gt;
== Ambassadors ==&lt;br /&gt;
* [[Ambassadors onboarding]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_ClearlyDefined_(internship)&amp;diff=1735</id>
		<title>Integrate Software Heritage and ClearlyDefined (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_ClearlyDefined_(internship)&amp;diff=1735"/>
		<updated>2023-03-26T09:23:08Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=[https://clearlydefined.io/ ClearlyDefined] is a project whose goal is to collaboratively and semi-automatically curate information about Free/Open Source Software (FOSS) projects, including licensing and vulnerability information.&lt;br /&gt;
As one of its main output, ClearyDefined maintains an open data knowledge-base that cross references FOSS source code artifacts found in version control systems, package repositories, etc. to curated information about their licenses and vulnerabilities. The same source code artifacts are archived by Software Heritage for long-term preservation purposes.&lt;br /&gt;
The goal of this internship is to integrate ClearlyDefined and Software Heritage, for mutual benefit.&lt;br /&gt;
Software Heritage will benefit from mirroring ClearlyDefined data, allowing to query them while navigating the archive and at scale; ClearlyDefined will benefit from learning about the existing of FOSS projects that have not been analyzed for &amp;quot;clarity&amp;quot; yet.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* JavaScript / NodeJS&lt;br /&gt;
* Python&lt;br /&gt;
* experience with database management systems (of any kind)&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Philippe Ombredanne (nexB)&lt;br /&gt;
* Valentin Lorentz&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (Software Heritage)&lt;br /&gt;
&lt;br /&gt;
|environment=&lt;br /&gt;
you will work shoulder to shoulder with members of the [https://www.softwareheritage.org/people/ Software Heritage] and [https://clearlydefined.io/about ClearlyDefined project], with mentors from both projects&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== Clearlydefined ==&lt;br /&gt;
&lt;br /&gt;
* [[ClearlyDefinedObject]]&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1729</id>
		<title>Graph query language for the archive (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Graph_query_language_for_the_archive_(internship)&amp;diff=1729"/>
		<updated>2022-11-10T13:17:12Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The Software Heritage archive is structured as a graph (specifically, a [https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: 20 billion nodes, 200 billion edges.&lt;br /&gt;
It has recently been verified that a compressed representation of the graph structure can fit in memory, whereas node/edge properties can be memory-mapped to secondary storage (see: [https://docs.softwareheritage.org/devel/swh-graph/ documentation] and in particular the [https://arxiv.org/abs/2001.08647 SANER 2020 paper referenced there]).&lt;br /&gt;
An [https://docs.softwareheritage.org/devel/swh-graph/api.html ad hoc API] is available to traverse the graph, with very limited querying capabilities.&lt;br /&gt;
The goal of this internship is to experiment with the possibility of querying the archive graph via state-of-the-art [https://en.wikipedia.org/wiki/Graph_database#Graph_query-programming_languages graph query languages].&lt;br /&gt;
&lt;br /&gt;
The ideal outcome of the internship will be a prototype of query engine that answers queries on top of the compressed graph representation plus associated property maps.&lt;br /&gt;
A prototype implementation will be developed for a target platform to be chosen during the internship.&lt;br /&gt;
Tentative target platforms include: [https://neo4j.com/ Neo4j] and [http://tinkerpop.apache.org/ Apache TinkerPop].&lt;br /&gt;
In either case a backend/compatibility layer for WebGraph [http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html ImmutableGraph] will be developed.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* Query languages for structured, semi-structured, or graph data (e.g., one or more among: SQL, Xquery, GraphQL, SPARQL, GQL, etc.)&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with graph traversal languages (e.g., [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin])&lt;br /&gt;
* experience with graph databases (e.g., [https://neo4j.com/ Neo4j])&lt;br /&gt;
&lt;br /&gt;
|workplace=[https://liris.cnrs.fr/en LIRIS (Univ. Lyon 1, Lyon)] or [https://www.inria.fr/en/centre-inria-de-paris Inria Paris] or contact mentors for remote opportunities&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* [https://perso.liris.cnrs.fr/angela.bonifati/ Angela Bonifati] &amp;lt;angela.bonifati@univ-lyon1.fr&amp;gt;&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Ongoing internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=TinkerPop_Gremlin_backend_for_WebGraph_(internship)&amp;diff=1718</id>
		<title>TinkerPop Gremlin backend for WebGraph (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=TinkerPop_Gremlin_backend_for_WebGraph_(internship)&amp;diff=1718"/>
		<updated>2022-07-15T09:58:19Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: mark internship as completed&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Software Heritage uses the [http://webgraph.di.unimi.it/ WebGraph] framework for graph compression. This allows to manipulate the huge archive Merkle DAG in RAM efficiently, via the [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph component]. The [https://docs.softwareheritage.org/devel/swh-graph/api.html current RPC API] to navigate the graph is however very limited and ad hoc. We would like to exploit the current compressed graph representation using a standard graph traversal language such as the [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin graph traversal language].&lt;br /&gt;
The goal of this internship is to design, implement, and experiment with a backend for [http://tinkerpop.apache.org/ Apache TinkerPop] (a popular open source implementation of Gremlin) that sits on top of WebGraph. If successful it will allow to traverse the huge Software Heritage graph with both the current efficiency and the convenience of a high-level and expressive graph traversal language.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* basic knowledge of graph theory&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with the implementation of graph-based applications&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Pietri (seirl on [[IRC]])&lt;br /&gt;
* [http://vigna.di.unimi.it/ Sebastiano Vigna]&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Completed internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1708</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1708"/>
		<updated>2022-03-18T09:08:03Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */ fix markup for ordered list&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
# there are multiple sources, and their contents must work together&lt;br /&gt;
# sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Vincent Sellier (vsellier on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;br /&gt;
[[Category:Available GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1694</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1694"/>
		<updated>2022-02-25T10:12:55Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Ideas list */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;div style=&amp;quot;text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;&amp;quot;&amp;gt;&lt;br /&gt;
This page is a work in progress; the list of retained organizations for Google Summer of Code 2022 has not been determined yet.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project (check individual idea pages for expected duration and difficulty of each task):&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a [[Internships|list of internship topics]], which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1693</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1693"/>
		<updated>2022-02-25T10:11:19Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Ideas list */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;div style=&amp;quot;text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;&amp;quot;&amp;gt;&lt;br /&gt;
This page is a work in progress; the list of retained organizations for Google Summer of Code 2022 has not been determined yet.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project:&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a [[Internships|list of internship topics]], which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1691</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1691"/>
		<updated>2022-02-25T10:01:51Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;div style=&amp;quot;text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;&amp;quot;&amp;gt;&lt;br /&gt;
This page is a work in progress; the list of retained organizations for Google Summer of Code 2022 has not been determined yet.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project.&lt;br /&gt;
They include both GSoC-specific tasks (that are only available as part of our GSoC participation) and project ideas that are also available as [[Internships|internship]] topics outside of GSoC.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy toconsider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2022&amp;diff=1690</id>
		<title>Category:Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2022&amp;diff=1690"/>
		<updated>2022-02-25T10:00:59Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: Created page with &amp;quot;Pages related to the 2022 edition of the [https://summerofcode.withgoogle.com/ Google summer of Code program].  Category:Google Summer of Code&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Pages related to the 2022 edition of the [https://summerofcode.withgoogle.com/ Google summer of Code program].&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2021&amp;diff=1689</id>
		<title>Category:Google Summer of Code 2021</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2021&amp;diff=1689"/>
		<updated>2022-02-25T10:00:37Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Pages related to the 2021 edition of the [https://summerofcode.withgoogle.com/ Google summer of Code program].&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2021&amp;diff=1688</id>
		<title>Category:Google Summer of Code 2021</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:Google_Summer_of_Code_2021&amp;diff=1688"/>
		<updated>2022-02-25T10:00:23Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Pages related to the 2022 edition of the [https://summerofcode.withgoogle.com/ Google summer of Code program].&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1687</id>
		<title>Mine information from external sources (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1687"/>
		<updated>2022-02-25T09:57:46Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archiving source code artifacts, Software Heritage is interested in&lt;br /&gt;
archive metadata from external sources and correlate it to source code artifacts.&lt;br /&gt;
This is also to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Collecting this extrinsic metadata is a&lt;br /&gt;
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome&lt;br /&gt;
to contribute to its implementation.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
You would contribute to the design of our metadata-fetching architecture.&lt;br /&gt;
This includes:&lt;br /&gt;
&lt;br /&gt;
* Review what metadata we want to fetch&lt;br /&gt;
* How to efficiently fetch it at regular intervals and store it&lt;br /&gt;
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can tackle more formats).&lt;br /&gt;
Difficulty: hard&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with software metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1686</id>
		<title>Mine information from archived content (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1686"/>
		<updated>2022-02-25T09:57:30Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archival, Software Heritage indexes the retrieved source code&lt;br /&gt;
artifacts, to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Indexing can happen at the individual file-level (e.g., detect the programming&lt;br /&gt;
language the file is written in or the license declared in its header), or at&lt;br /&gt;
more coarse grained granularity (e.g., what metadata are declared for the most&lt;br /&gt;
recently archived version of a given project).&lt;br /&gt;
&lt;br /&gt;
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],&lt;br /&gt;
such as:&lt;br /&gt;
&lt;br /&gt;
* file level mining:&lt;br /&gt;
** MIME type detection (using libmagic)&lt;br /&gt;
** license detection (using FOSSology/nomossa)&lt;br /&gt;
** language detection (using Pygments)&lt;br /&gt;
** ctags extraction (using universal-ctags)&lt;br /&gt;
* project level mining:&lt;br /&gt;
** Ruby gemspec metadata&lt;br /&gt;
** Python PKG-INFO metadata&lt;br /&gt;
** Maven pom.xml metadata&lt;br /&gt;
** NPM package.json metadata&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Writing additional indexers that extract more information from archived source&lt;br /&gt;
code is welcome and would constitute a suitable GSoC project.&lt;br /&gt;
&lt;br /&gt;
Name the kind of data mining you want to do!&lt;br /&gt;
&lt;br /&gt;
For inspiration you can have a look at [https://libraries.io Libraries.io], as&lt;br /&gt;
most package formats/package managers support dedicated ways of expressing&lt;br /&gt;
metadata and we only support a small number of them up-to-now. But do not&lt;br /&gt;
restrict your ambition to those, any kind of data extraction/mining you want to&lt;br /&gt;
do on the archive could work.&lt;br /&gt;
&lt;br /&gt;
You may also add support for multiple formats at once, using an external tool,&lt;br /&gt;
such as [https://github.com/datacite/bolognese Bolognese] or&lt;br /&gt;
[https://github.com/librariesio/bibliothecary/ bibliothecary].&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with (source code) metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_Deposit_modular_(GSoC_task)&amp;diff=1685</id>
		<title>Make the Deposit modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_Deposit_modular_(GSoC_task)&amp;diff=1685"/>
		<updated>2022-02-25T09:57:14Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: StefanoZacchiroli moved page Make the Deposit modular (GSoC task) to Make the software deposit service (swh-deposit) modular (GSoC task)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[Make the software deposit service (swh-deposit) modular (GSoC task)]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1684</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1684"/>
		<updated>2022-02-25T09:57:14Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: StefanoZacchiroli moved page Make the Deposit modular (GSoC task) to Make the software deposit service (swh-deposit) modular (GSoC task)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic (currently implemented mostly in swh/deposit/api/common.py, but is tightly coupled with the views)&lt;br /&gt;
* client-side business logic &lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Possible extensions include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1683</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1683"/>
		<updated>2022-02-25T09:56:48Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic (currently implemented mostly in swh/deposit/api/common.py, but is tightly coupled with the views)&lt;br /&gt;
* client-side business logic &lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Possible extensions include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1682</id>
		<title>Improve and extend the archive Web UI (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1682"/>
		<updated>2022-02-25T09:56:20Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Several improvements are possible on the archive Web interface and would make&lt;br /&gt;
great GSoC projects, some ideas to whet your appetite:&lt;br /&gt;
&lt;br /&gt;
* add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ... (note that this will also require backend design and implementation)&lt;br /&gt;
* improve [https://www.w3.org/WAI/ accessibility]&lt;br /&gt;
* display metadata we already mined from the archive&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
* if you choose the developer-oriented features: 350 hours. Difficulty: hard&lt;br /&gt;
* for others mentioned above: 175 hours. Difficulty: easy&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Django&lt;br /&gt;
* web development and/or design&lt;br /&gt;
* Javascript knowledge is useful but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
[[Improve project search engine (GSoC task)]] is an independent task,&lt;br /&gt;
which may or may not involve improvements to the Web UI, depending on&lt;br /&gt;
you tastes.&lt;br /&gt;
&lt;br /&gt;
While this task can be about displaying metadata we already mined,&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to mine more&lt;br /&gt;
of this metadata; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1681</id>
		<title>Dashboard UI for the Code Scanner (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1681"/>
		<updated>2022-02-25T09:56:00Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.&lt;br /&gt;
&lt;br /&gt;
The Software Heritage scanner (&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt;) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code], [https://upsilon.cc/~zack/talks/2021/2021-04-07-llw.pdf slides of a 2021 presentation about swh-scanner]) is a command line tool that enables doing that.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt; is currently an experimental tool, which works well in practice, but need a real '''dashboard user interface''' to be useful.&lt;br /&gt;
Several output options are currently available when invoking the &amp;lt;code&amp;gt;swh scanner scan&amp;lt;/code&amp;gt; command, in particular batch output in textual and JSON format, and an interactive &lt;br /&gt;
dashboard (with the &amp;lt;code&amp;gt;-i/--interactive&amp;lt;/code&amp;gt;) option.&lt;br /&gt;
&lt;br /&gt;
The interactive view currently works by producing a local HTML file and opening it using the local browser.&lt;br /&gt;
The goal of this project is to '''improve the interactive view''', making it a serious dashboard-style UI to peruse scanning results.&lt;br /&gt;
&lt;br /&gt;
The following improvements are suggested, although more can be proposed (and even more could be discovered during the project work):&lt;br /&gt;
&lt;br /&gt;
* Technology: generating a local HTML file is not necessarily the best way to render results, alternative solutions should be explored, including a self-hosted web app, rendering results with state-of-the-art frontend web frameworks (css/html/javascript)&lt;br /&gt;
* Scalability: currently rendering doesn't work when scanning large code bases such as the Linux kernel, rendering should be made lazy, by only loading data to show when needed&lt;br /&gt;
* Functionality: dashboard rendering should be integrated with the possibility of opening the local source code files that have been scanned, e.g., users will want to be able to open in-browser files that have been detected as known/unknown, in order to figure why&lt;br /&gt;
* Functionality: in the future additional information will be added to scanning results, including license and provenance information. While not yet available right now due to backend limitations, the proposed UI should plan ahead about how/where to display such information&lt;br /&gt;
* Paper cuts: [https://forge.softwareheritage.org/tag/code_scanner/ various issues] affect the usability of swh-scanner, improving them would be welcome as part of this project&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
350 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* HTML/CSS/JavaScript and web development in general&lt;br /&gt;
* Working knowledge of UI/UX design principles&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_embeddable_widgets_(GSoC_task)&amp;diff=1680</id>
		<title>Create embeddable widgets (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_embeddable_widgets_(GSoC_task)&amp;diff=1680"/>
		<updated>2022-02-25T09:55:47Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Create embeddable JS widgets to make SWH features easily available.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
The idea is to have a set of JS widgets that can be embedded in any web page with JS enabled.&lt;br /&gt;
Widgets will be smart enough to make their own api calls and render the results.&lt;br /&gt;
&lt;br /&gt;
Some widgets could be&lt;br /&gt;
* SWH search box&lt;br /&gt;
* SWH Search results&lt;br /&gt;
* SWH Browse (for Project/revision or file)&lt;br /&gt;
* Save code now.&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
350 hours. Difficulty: hard&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* web development and/or design&lt;br /&gt;
* Javascript&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1679</id>
		<title>Create a browser extension (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1679"/>
		<updated>2022-02-25T09:55:20Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
When browsing a repository (on GitHub, Gitlab, ...) or a package description (on NPM, Debian),&lt;br /&gt;
people may want to check when (and if) this repository or package was last archived in Software Heritage.&lt;br /&gt;
Currently, this means opening the archive in a new tab, and searching for the URL, and looking at the status.&lt;br /&gt;
Then, they can trigger a new archival with another few clicks (via the &amp;quot;Save Code Now&amp;quot; feature).&lt;br /&gt;
&lt;br /&gt;
This workflow may be streamlined by the [https://forge.softwareheritage.org/T3756 creation of a browser extension or bookmarklet].&lt;br /&gt;
This extension/bookmarklet would, for example, show an icon next to the URL bar to show the status of the currently visited repository;&lt;br /&gt;
and clicking it would show details (like the date of last visit) and run &amp;quot;Save Code Now&amp;quot; in just two clicks.&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 hours. Difficulty: easy&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Javascript experience is also needed for this project&lt;br /&gt;
* Prior experience in working with browser extensions is a plus&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]]) &lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Main_Page&amp;diff=1671</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Main_Page&amp;diff=1671"/>
		<updated>2022-02-16T12:44:55Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: /* Students */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Welcome to the wiki of [[Software Heritage]].&lt;br /&gt;
&lt;br /&gt;
We are just getting started, so please bear with us while we organize the content of the wiki.&amp;lt;br &amp;gt;&lt;br /&gt;
In the meantime you can find below entry points for various communities we're collaborating with.&lt;br /&gt;
&lt;br /&gt;
== General ==&lt;br /&gt;
&lt;br /&gt;
* [[Suggestion box]] for software we should add to the [[Archive]] ← add your entry here!&lt;br /&gt;
* [[Talks]] about Software Heritage&lt;br /&gt;
&lt;br /&gt;
== Developers ==&lt;br /&gt;
&lt;br /&gt;
* Read the [https://docs.softwareheritage.org/devel/ docs]&lt;br /&gt;
* Dive into the [https://forge.softwareheritage.org/ code]&lt;br /&gt;
* Subscribe to the [[Mailing lists]]&lt;br /&gt;
* Chat with us on [[IRC]]&lt;br /&gt;
&lt;br /&gt;
== Scientists ==&lt;br /&gt;
&lt;br /&gt;
* [[:Category:Related work|Related work]]&lt;br /&gt;
* [[Working groups]]&lt;br /&gt;
&lt;br /&gt;
== Students ==&lt;br /&gt;
&lt;br /&gt;
* [[Internships|Internship opportunities]]&lt;br /&gt;
* [[Google Summer of Code 2022]]&lt;br /&gt;
&lt;br /&gt;
== Ambassadors ==&lt;br /&gt;
* [[Ambassadors onboarding]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Archive_search_query_language_(internship)&amp;diff=1670</id>
		<title>Archive search query language (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Archive_search_query_language_(internship)&amp;diff=1670"/>
		<updated>2022-02-16T12:43:31Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The current [https://archive.softwareheritage.org/browse/search/ archive search engine] accepts a single list of tokens that are searched either in across origin URLs or [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ extracted metadata].&lt;br /&gt;
The goal of this internship is to design and implement an archive query language that allows to mix terms, boolean connectors, and operators (as [https://support.google.com/websearch/answer/2466433 other] [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html search] [https://help.qwant.com/help/qwant-junior/refine-search-with-operators/ engines] do).&lt;br /&gt;
Part of this internship is language design, part language parsing, and part language evaluation (by linking together query results returned by the already existing search backends for the archive).&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with data mining and/or information retrieval and/or web searches&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Completed internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Expand_metadata_search_coverage_(internship)&amp;diff=1669</id>
		<title>Expand metadata search coverage (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Expand_metadata_search_coverage_(internship)&amp;diff=1669"/>
		<updated>2022-02-16T12:42:48Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: StefanoZacchiroli moved page Expand metadata search coverage (internship) to Expand package metadata coverage (internship)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[Expand package metadata coverage (internship)]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Expand_package_metadata_coverage_(internship)&amp;diff=1668</id>
		<title>Expand package metadata coverage (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Expand_package_metadata_coverage_(internship)&amp;diff=1668"/>
		<updated>2022-02-16T12:42:48Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: StefanoZacchiroli moved page Expand metadata search coverage (internship) to Expand package metadata coverage (internship)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=[https://archive.softwareheritage.org/browse/search/ searching] projects in the Software Heritage archive is currently possible by either (parts of) URL or by [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ package metadata].&lt;br /&gt;
Currently, only a limited number of package metadata are [https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html#supported-intrinsic-metadata supported], including Maven, NPM, PyPI, and Gems.&lt;br /&gt;
The goal of this internship is to extend the coverage of supported metadata to additional package managers, the long-term goal being supporting all [https://libraries.io/ Libraries.io]-indexed package managers.&lt;br /&gt;
&lt;br /&gt;
For more information of the existing tools, you can read our [https://www.softwareheritage.org/2019/05/28/mining-software-metadata-for-80-m-projects-and-even-more/ metadata blog post] or dive into the [https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html#adding-support-for-additional-ecosystem-specific-metadata technical tutorial]&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* knowledge of linked data technologies and ontologies (e.g., RDFa, JSON-LD, OWL, etc.)&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Morane Gruenpeter (moranegg on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@irif.fr&amp;gt;&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=TinkerPop_Gremlin_backend_for_WebGraph_(internship)&amp;diff=1667</id>
		<title>TinkerPop Gremlin backend for WebGraph (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=TinkerPop_Gremlin_backend_for_WebGraph_(internship)&amp;diff=1667"/>
		<updated>2022-02-16T12:40:40Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Software Heritage uses the [http://webgraph.di.unimi.it/ WebGraph] framework for graph compression. This allows to manipulate the huge archive Merkle DAG in RAM efficiently, via the [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph component]. The [https://docs.softwareheritage.org/devel/swh-graph/api.html current RPC API] to navigate the graph is however very limited and ad hoc. We would like to exploit the current compressed graph representation using a standard graph traversal language such as the [https://en.wikipedia.org/wiki/Gremlin_(query_language) Gremlin graph traversal language].&lt;br /&gt;
The goal of this internship is to design, implement, and experiment with a backend for [http://tinkerpop.apache.org/ Apache TinkerPop] (a popular open source implementation of Gremlin) that sits on top of WebGraph. If successful it will allow to traverse the huge Software Heritage graph with both the current efficiency and the convenience of a high-level and expressive graph traversal language.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Java development&lt;br /&gt;
* basic knowledge of graph theory&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with the implementation of graph-based applications&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Pietri (seirl on [[IRC]])&lt;br /&gt;
* [http://vigna.di.unimi.it/ Sebastiano Vigna]&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Ongoing internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_GHTorrent_(internship)&amp;diff=1666</id>
		<title>Integrate Software Heritage and GHTorrent (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Integrate_Software_Heritage_and_GHTorrent_(internship)&amp;diff=1666"/>
		<updated>2022-02-16T12:40:17Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=Software Heritage is building the largest source code repository in existence,&lt;br /&gt;
initially populated with all projects from GitHub.&lt;br /&gt;
The [http://ghtorrent.org/ GHTorrent] project collects&lt;br /&gt;
and archives data from the GitHub API, including issues, teams, pull requests and&lt;br /&gt;
commits. The purpose of this internship is to integrate the construction processes&lt;br /&gt;
of the respective datasets.&lt;br /&gt;
The goal is to allow the two projects to be updated independently but also create&lt;br /&gt;
a fusion point where updates from either project's database are integrated into a&lt;br /&gt;
centralized, query-able archive in a streaming fashion.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* knowledge of streaming data technologies&lt;br /&gt;
* familiarity of the internals of Git&lt;br /&gt;
* familiarity with the GitHub API&lt;br /&gt;
* working knowledge of any/more of Python, Kafka, Postgres, MySQL, and MongoDB would be a plus&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Georgios Gousios &amp;lt;g.gousios@tudelft.nl&amp;gt;&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Internship]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Gsoc&amp;diff=1665</id>
		<title>Gsoc</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Gsoc&amp;diff=1665"/>
		<updated>2022-02-16T12:39:06Z</updated>

		<summary type="html">&lt;p&gt;StefanoZacchiroli: Changed redirect target from Google Summer of Code 2021 to Google Summer of Code 2022&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>StefanoZacchiroli</name></author>
	</entry>
</feed>