<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Vlorentz</id>
	<title>Software Heritage Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Vlorentz"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/wiki/Special:Contributions/Vlorentz"/>
	<updated>2026-04-20T12:18:30Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1867</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1867"/>
		<updated>2025-09-07T08:43:17Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: simplify phrasing&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ Matrix] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://matrix.to/#/#swh-devel:matrix.org '''#swh-devel:matrix.org''']: public development discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-sysadm:matrix.org '''#swh-sysadm:matrix.org''']: operations team discussions/bots&lt;br /&gt;
* [https://matrix.to/#/#swh-offtopic:matrix.org '''#swh-offtopic:matrix.org''']: Off-topic discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-team:matrix.org '''#swh-team:matrix.org''']: private discussions of the core team&lt;br /&gt;
* [https://matrix.to/#/#swh:matrix.org '''#swh:matrix.org''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
They are all part of the [https://matrix.to/#/#swh-space:matrix.org '''#swh-space:matrix.org'''] Matrix space.&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage closed their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1866</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1866"/>
		<updated>2025-09-07T07:26:38Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Use complete room alias instead of just the local part&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ Matrix] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://matrix.to/#/#swh-devel:matrix.org '''#swh-devel:matrix.org''']: public development discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-sysadm:matrix.org '''#swh-sysadm:matrix.org''']: operations team discussions/bots&lt;br /&gt;
* [https://matrix.to/#/#swh-offtopic:matrix.org '''#swh-offtopic:matrix.org''']: Off-topic discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-team:matrix.org '''#swh-team:matrix.org''']: private discussions of the core team&lt;br /&gt;
* [https://matrix.to/#/#swh:matrix.org '''#swh:matrix.org''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
They are all part of the [https://matrix.to/#/#swh-space:matrix.org '''#swh-space:matrix.org'''] Matrix space.&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1865</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1865"/>
		<updated>2025-09-07T07:26:08Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Matrix rooms */ Matrix is the network, matrix.org is a homeserver (though also the front page of the network)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ Matrix] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://matrix.to/#/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://matrix.to/#/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://matrix.to/#/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
They are all part of the [https://matrix.to/#/#swh-space:matrix.org '''#swh-space'''] Matrix space.&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1864</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1864"/>
		<updated>2025-09-07T07:24:36Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ matrix.org] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://matrix.to/#/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://matrix.to/#/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://matrix.to/#/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
They are all part of the [https://matrix.to/#/#swh-space:matrix.org '''#swh-space'''] Matrix space.&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1863</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1863"/>
		<updated>2025-09-07T07:23:23Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Matrix rooms */ Use matrix.to links instead of hardcoding the app.element.io client&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ matrix.org] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://matrix.to/#/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://matrix.to/#/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://matrix.to/#/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://matrix.to/#/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1853</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1853"/>
		<updated>2024-09-12T15:19:24Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: s/Element account/Matrix account/&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ matrix.org] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://app.element.io/#/room/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://app.element.io/#/room/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://app.element.io/#/room/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a Matrix account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Easy_hacks&amp;diff=1825</id>
		<title>Easy hacks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Easy_hacks&amp;diff=1825"/>
		<updated>2024-02-20T20:39:43Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: fix link&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;A '''easy hack''' (or a &amp;quot;easy bug&amp;quot;, or a &amp;quot;newcomer bug&amp;quot;, [http://wiki.openhatch.org/Easy_bugs_for_newcomers depending on the project]) is a pending task in the development of a collaborative Free/Open Source project that is considered to be well-suited for new contributors.&lt;br /&gt;
It might indeed be a trivial bug to fix/task to accomplish, or it might be something less trivial but that is well adapted to learn how a specific part/process of the project works.&lt;br /&gt;
&lt;br /&gt;
[[Software Heritage]] curates a [https://forge.softwareheritage.org/tag/easy_hack/ list of easy hacks] that are suitable for newcomers to the project.&lt;br /&gt;
Any in the [[Phabricator|forge]] tagged with &amp;lt;code&amp;gt;Easy hack&amp;lt;/code&amp;gt; will appear on the list.&lt;br /&gt;
&lt;br /&gt;
* if '''you are a newcomer''' to Software Heritage wondering how you can help, have a look at [https://forge.softwareheritage.org/tag/easy_hack/ the list] and see if anything there inspires you&lt;br /&gt;
* if '''you are a contributor''' to Software Heritage already, keep an eye on tasks that are easy and/or suitable for newcomers and tag them with &amp;lt;code&amp;gt;Easy hack&amp;lt;/code&amp;gt; as needed&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
* [https://gitlab.softwareheritage.org/dashboard/issues?sort=created_date&amp;amp;state=opened&amp;amp;label_name%5B%5D=Easy+hack list of easy hacks]&lt;br /&gt;
* [https://gitlab.softwareheritage.org/dashboard/issues?sort=created_date&amp;amp;state=opened&amp;amp;assignee_id=None bug tracker], with all unassigned bugs/tasks&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Easy_hacks&amp;diff=1824</id>
		<title>Easy hacks</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Easy_hacks&amp;diff=1824"/>
		<updated>2024-02-20T20:38:57Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Links */ Link to Gitlab instead of Phabricator&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;A '''easy hack''' (or a &amp;quot;easy bug&amp;quot;, or a &amp;quot;newcomer bug&amp;quot;, [http://wiki.openhatch.org/Easy_bugs_for_newcomers depending on the project]) is a pending task in the development of a collaborative Free/Open Source project that is considered to be well-suited for new contributors.&lt;br /&gt;
It might indeed be a trivial bug to fix/task to accomplish, or it might be something less trivial but that is well adapted to learn how a specific part/process of the project works.&lt;br /&gt;
&lt;br /&gt;
[[Software Heritage]] curates a [https://forge.softwareheritage.org/tag/easy_hack/ list of easy hacks] that are suitable for newcomers to the project.&lt;br /&gt;
Any in the [[Phabricator|forge]] tagged with &amp;lt;code&amp;gt;Easy hack&amp;lt;/code&amp;gt; will appear on the list.&lt;br /&gt;
&lt;br /&gt;
* if '''you are a newcomer''' to Software Heritage wondering how you can help, have a look at [https://forge.softwareheritage.org/tag/easy_hack/ the list] and see if anything there inspires you&lt;br /&gt;
* if '''you are a contributor''' to Software Heritage already, keep an eye on tasks that are easy and/or suitable for newcomers and tag them with &amp;lt;code&amp;gt;Easy hack&amp;lt;/code&amp;gt; as needed&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
* [https://gitlab.softwareheritage.org/dashboard/issues?sort=created_date&amp;amp;state=opened&amp;amp;label_name[]=Easy+hack list of easy hacks]&lt;br /&gt;
* [https://gitlab.softwareheritage.org/dashboard/issues?sort=created_date&amp;amp;state=opened&amp;amp;assignee_id=None bug tracker], with all unassigned bugs/tasks&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1802</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1802"/>
		<updated>2023-11-30T10:05:42Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Matrix rooms ==&lt;br /&gt;
&lt;br /&gt;
The following rooms have been registered on the [https://matrix.org/ matrix.org] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://app.element.io/#/room/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://app.element.io/#/room/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://app.element.io/#/room/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
You will be asked to create a [https://element.io/ Element] account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
Software Heritage has made the decision to close their former [https://libera.chat libera.chat] IRC channels on 2023-12-08&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1739</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1739"/>
		<updated>2023-06-10T16:26:40Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Matrix bridge */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IRC channels ==&lt;br /&gt;
&lt;br /&gt;
The following channels have been registered on the [https://libera.chat/ libera.chat] IRC network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://app.element.io/#/room/#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://app.element.io/#/room/#swh-offtopic:matrix.org '''#swh-offtopic''']: Off-topic discussions&lt;br /&gt;
* [https://app.element.io/#/room/#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://app.element.io/#/room/#swh:matrix.org '''#swh''']: general discussions around Software Heritage&lt;br /&gt;
&lt;br /&gt;
If you use IRC, consider joining the channels.&lt;br /&gt;
&lt;br /&gt;
If you don't use IRC ''directly'', you can still join our chat channels from your web browser via a [https://matrix.org/ Matrix] bridge by clicking on the channel names in the list above. You will be asked to create a [https://element.io/ Element] account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
== IRC authentication ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Libera.chat defaults to blocking private messages from unauthentified users! All users should register their nicknames to be able to message one another privately, by following the instructions below.&amp;lt;/b&amp;gt; If you're really unable to register, you should ask your correspondent to [https://libera.chat/guides/usermodes consider setting usermode &amp;lt;tt&amp;gt;-R&amp;lt;/tt&amp;gt;, and &amp;lt;tt&amp;gt;+g&amp;lt;/tt&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
To register an account with NickServ, please follow [https://libera.chat/guides/registration the registration instructions provided by libera.chat staff].&lt;br /&gt;
&lt;br /&gt;
You will then receive an e-mail containing a link to activate you account. After doing so, you need to configure your client to auto-authenticate. The recommended way of doing that is using [https://libera.chat/guides/sasl SASL authentication].&lt;br /&gt;
&lt;br /&gt;
For matrix, the relevant docs is here: https://github.com/matrix-org/matrix-appservice-irc/wiki/End-user-FAQ#how-do-i-registeridentify-to-nickserv&lt;br /&gt;
&lt;br /&gt;
libera.chat also supports authentication via [https://libera.chat/guides/certfp TLS client certificates (using SASL EXTERNAL)].&lt;br /&gt;
&lt;br /&gt;
=== Matrix bridge ===&lt;br /&gt;
&lt;br /&gt;
For registering an account through the Matrix bridge ([https://github.com/matrix-org/matrix-appservice-irc/wiki/End-user-FAQ#how-do-i-registeridentify-to-nickserv relevant docs here]), please follow these instructions:&lt;br /&gt;
&lt;br /&gt;
1. Choose a short nickname (the default nickname picked by the matrix bridge has a [m] and can be quite long, as it defaults to your Matrix display name (minus non-ASCII non-alphanumerical characters)&lt;br /&gt;
&lt;br /&gt;
 /msg @appservice:libera.chat !nick &amp;lt;USERNAME&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2. Send this command to NickServ to register your account:&lt;br /&gt;
&lt;br /&gt;
 /msg @NickServ:libera.chat register &amp;lt;PASSWORD&amp;gt; &amp;lt;EMAIL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
3. Once you receive the confirmation email with a token, activate your account by using:&lt;br /&gt;
&lt;br /&gt;
 /msg @NickServ:libera.chat VERIFY REGISTER &amp;lt;USERNAME&amp;gt; &amp;lt;TOKEN RECEIVED BY EMAIL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
4. Give the Matrix bridge appservice your password so that you get identified automatically when matrix reconnects you to IRC:&lt;br /&gt;
&lt;br /&gt;
 /msg @appservice:libera.chat !username &amp;lt;USERNAME&amp;gt;&lt;br /&gt;
 /msg @appservice:libera.chat !storepass &amp;lt;PASSWORD&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== IRC access list ==&lt;br /&gt;
&lt;br /&gt;
To auto-voice people with a registered nick (only doable by people with +fA access modes will be able to do it), add them to the team channel access list:&lt;br /&gt;
&lt;br /&gt;
 /msg chanserv flags #swh-team add &amp;lt;nickname&amp;gt; Staff&lt;br /&gt;
&lt;br /&gt;
Other channels pick their ACLs from that of the #swh-team channel.&lt;br /&gt;
&lt;br /&gt;
If you already have the right (+o ChanServ flag), you can make yourself an operator, with:&lt;br /&gt;
&lt;br /&gt;
 /msg chanserv OP #swh-devel&lt;br /&gt;
&lt;br /&gt;
[[Category:Infrastructure]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1737</id>
		<title>A practical approach to efficiently store 100 billions small objects in Ceph</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph&amp;diff=1737"/>
		<updated>2023-04-30T13:35:32Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Finish removing the EOS column (only partially removed by &amp;quot;EOS is not based on Ceph&amp;quot; causing cells to be mismatched)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [https://en.wikipedia.org/wiki/Software_Heritage Software Heritage] project mission is to collect, preserve and share all software that is available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole. As of February 2021 it contains 10 billions unique source code files (or “objects”, in the following) totaling ~750TB of (uncompressed) data and grows by 50TB every month. 75% of these objects have a size smaller than 16KB and 50% have a size smaller than 4KB. But these small objects only account for ~5% of the 750TB: 25% of the objects have a size &amp;amp;gt; 16KB and occupy ~700TB.&lt;br /&gt;
&lt;br /&gt;
The desired performances for '''10PB''' and '''100 billions objects''' are as follows:&lt;br /&gt;
&lt;br /&gt;
* The clients aggregated together can write at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* The clients aggregated together can read at least 3,000 objects/s and at least 100MB/s.&lt;br /&gt;
* There is no space amplification for small objects.&lt;br /&gt;
* Getting the first byte of any object never takes longer than 100ms.&lt;br /&gt;
* Objects can be enumerated in bulk, at least one million at a time.&lt;br /&gt;
* Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.&lt;br /&gt;
&lt;br /&gt;
Using an off-the-shelf object storage such as the [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] or [https://min.io/ MinIO] does not meet the requirements:&lt;br /&gt;
&lt;br /&gt;
* There is a significant space amplification for small objects: at least 25%, depending on the object storage (see “How does packing Objects save space?” below for details)&lt;br /&gt;
* Mirroring the content of the archive can only be done one object at a time and not in bulk which takes at least 10 times longer (see “How does packing Objects help with enumeration?” for details)&lt;br /&gt;
&lt;br /&gt;
A new solution must be implemented by re-using existing components and made available for system administrators to conveniently deploy and maintain in production. There are three ways to do that:&lt;br /&gt;
&lt;br /&gt;
* Contribute packaging and stable releases to a codebase such as [https://github.com/linkedin/ambry Ambry].&lt;br /&gt;
* Modify an object storage such as MinIO to support object packing.&lt;br /&gt;
* Get inspiration from an object storage design.&lt;br /&gt;
&lt;br /&gt;
For reasons explained below (see “Storage solutions and TCO”), it was decided to design a new object storage and implement it from scratch.&lt;br /&gt;
&lt;br /&gt;
= Proposed object storage design =&lt;br /&gt;
&lt;br /&gt;
In a nutshell, objects are written to databases running on a fixed number of machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) all objects are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time. After a successful write, a unique identifier (the Object ID) is returned to the client. It can be used to read the object back from the readonly storage. Reads scale out because the unique identifiers of the objects embed the name of the container (the Shard UUID). Writes also scales out because the Database is chosen randomly. This is the Layer 0.&lt;br /&gt;
&lt;br /&gt;
Clients that cannot keep track of the name of the container can rely on an API that relies on an index mapping all known objects signatures (the Object HASH below) to the name of the container where they can be found. Although this index prevents scaling out writes, the readonly storage can still scale out by multiplying copies of the index as needed. This is the Layer 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
                      Layer 0 scales out&lt;br /&gt;
&lt;br /&gt;
      +--- write op ----+               +--- read  op ----+&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
   Object &amp;amp;             |               |                 |&lt;br /&gt;
   Object HASH     Object ID         Object ID         Object&lt;br /&gt;
      |            Object HASH          |                 |&lt;br /&gt;
      v            Shard UUID           v                 ^&lt;br /&gt;
      |                 |               |                 |&lt;br /&gt;
      v                 ^               v                 ^&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage --------+&lt;br /&gt;
    |                           |  |                          |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database |-&amp;gt;--Packing-&amp;gt;----&amp;gt; | Shard |      | Shard | |&lt;br /&gt;
    | +----------+              |  | +-------+      +-------+ |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    | | Database || Database |  |  | | Shard |      | Shard | |&lt;br /&gt;
    | +----------++----------+  |  | +-------+      +-------+ |&lt;br /&gt;
    |                           |  | +-------+      +-------+ |&lt;br /&gt;
    +---------------------------+  | | Shard |      | Shard | |&lt;br /&gt;
                                   | +-------+      +-------+ |&lt;br /&gt;
                                   |            ...           |&lt;br /&gt;
                                   +--------------------------+&lt;br /&gt;
&lt;br /&gt;
                      Layer 1 reads scale out&lt;br /&gt;
&lt;br /&gt;
    +---- Write Storage --------+  +---- Read Storage ---------+&lt;br /&gt;
    |                           |  |                           |&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    ||Object HASH to Shard UUID||  ||Object HASH to Shard UUID||&lt;br /&gt;
    ||        index            |&amp;gt;&amp;gt;&amp;gt;&amp;gt;|        index            ||&lt;br /&gt;
    |+-------------------------+|  |+-------------------------+|&lt;br /&gt;
    +---------------------------+  |+-------------------------+|&lt;br /&gt;
       |                 |         ||Object HASH to Shard UUID||&lt;br /&gt;
       ^                 v         ||        index            ||&lt;br /&gt;
       |                 |         |+-------------------------+|&lt;br /&gt;
       ^                 v         |          ...              |&lt;br /&gt;
     Object              |         +---------------------------+&lt;br /&gt;
   Object HASH           v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |            Object HASH        Object&lt;br /&gt;
       ^                 v                |                 |&lt;br /&gt;
       |                 |                ^                 v&lt;br /&gt;
       +--- write op ----+                +--- read  op ----+&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Ceph-objstorage-sw-architecture.svg]]&lt;br /&gt;
&lt;br /&gt;
== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* Object: an opaque sequence of bytes.&lt;br /&gt;
* Object HASH: the hash of an Object, e.g., the checksum part of a [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers SWHID].&lt;br /&gt;
* Shard: a group of Objects, used to partition the full set of objects into manageable subsets.&lt;br /&gt;
* Shard UUID: the unique identifier of a Shard, as a [https://en.wikipedia.org/wiki/Universally_unique_identifier UUID].&lt;br /&gt;
* Object ID: a pair made of the Object HASH and the Shard UUID containing the object.&lt;br /&gt;
* Global Index: a table mapping the Object HASH to the Shard UUID that contains the Object.&lt;br /&gt;
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.&lt;br /&gt;
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.&lt;br /&gt;
* Object Storage: the content of the Write Storage and the Read Storage combined.&lt;br /&gt;
* Database: [https://en.wikipedia.org/wiki/PostgreSQL PostgreSQL], [https://en.wikipedia.org/wiki/Apache_Cassandra Cassandra], etc.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Ceph_(software) Ceph]: a self-healing distributed storage.&lt;br /&gt;
* [https://docs.ceph.com/en/latest/rbd/ RBD] image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Total_cost_of_ownership TCO]: Total Cost of Ownership&lt;br /&gt;
&lt;br /&gt;
The key concepts are:&lt;br /&gt;
&lt;br /&gt;
* Packing millions of Objects together in Shards to:&lt;br /&gt;
** save space and,&lt;br /&gt;
** efficiently perform bulk actions such as mirroring or enumerations.&lt;br /&gt;
* Two different storage:&lt;br /&gt;
** Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,&lt;br /&gt;
** Write Storage from which Shards are created and moved to the Read Storage.&lt;br /&gt;
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.&lt;br /&gt;
&lt;br /&gt;
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.&lt;br /&gt;
&lt;br /&gt;
The content of the Object Storage (i.e., the Write Storage and the Read Storage combined) is '''strongly/strictly consistent'''. As soon as an Object is written (i.e., the write operation returns to the client), a reader can get the Object content from the Object Storage (with the caveat that it may require looking up the object from both the Write Storage and Read Storage).&lt;br /&gt;
&lt;br /&gt;
The Read Storage is '''eventually consistent'''. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given point in time.&lt;br /&gt;
&lt;br /&gt;
== Layer 0 (Object lookup require a complete Object ID) ==&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** A fixed number of Databases&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Shards implemented as Ceph RBD images named after their Shard UUID&lt;br /&gt;
** The content of the Shard uses a format that allows retrieving an Object in O(1) given the Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database is associated with a unique Shard UUID, chosen at random. All Objects written to a Database will be stored in the same Shard.&lt;br /&gt;
&lt;br /&gt;
A successful Object write returns the Object ID. Writing the same object twice may return different Object IDs. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different since it is chosen at random.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
When a Database grows bigger than a threshold (for instance 100GB), it stops accepting writes. A Shard is created in the Read Storage and Objects in the Database are sorted and copied to it. When the Shard is complete, the Database is deleted. Another Database is created, a new Shard UUID is allocated and it starts accepting writes.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
The Shard UUID is extracted from the Object ID. If a Shard exists in the Read Storage, the Object HASH is used to lookup the content of the Object. Otherwise the Database that owns the Shard UUID is looked up in the Write Storage and the Object HASH is used to lookup the content of the Object. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.&lt;br /&gt;
&lt;br /&gt;
== Layer 1 (Objects can be looked up using the Object HASH alone) ==&lt;br /&gt;
&lt;br /&gt;
A Global Index mapping the Object HASH of all known Objects to the Shard UUID is used to:&lt;br /&gt;
&lt;br /&gt;
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.&lt;br /&gt;
* deduplicate identical Objects based on their Object HASH&lt;br /&gt;
&lt;br /&gt;
=== Architecture ===&lt;br /&gt;
&lt;br /&gt;
* Write Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Write Storage and the Read Storage&lt;br /&gt;
* Read Storage:&lt;br /&gt;
** Read/write Global Index of all known Objects in the Read Storage&lt;br /&gt;
** Multiple readonly replicas of the Global Index of all known Objects in the Read Storage&lt;br /&gt;
&lt;br /&gt;
=== Writing ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH exists in the Read Storage Global Index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage Global Index. There may be duplicate Objects in the Write Storage. It is expected that they race to be inserted in the Write Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Packing ===&lt;br /&gt;
&lt;br /&gt;
During packing, each Object HASH is looked up in the Read Storage Global Index. If it exists, the object is discarded. Otherwise its Object ID is added to the Read Storage Global Index. When packing is complete:&lt;br /&gt;
&lt;br /&gt;
* Readonly replicas of the Read Storage Global Index are updated with the newly added Object IDs.&lt;br /&gt;
* Object HASH that were found to be duplicate are updated in the Write Storage Global Index. The Object HASH is mapped to the Shard UUID retrieved from the Read Storage Global Index.&lt;br /&gt;
&lt;br /&gt;
=== Reading ===&lt;br /&gt;
&lt;br /&gt;
If the Object HASH is found in the Read Storage Global Index, use the Shard UUID to read the Object content from the Shard found in the Read Storage. Otherwise lookup the Object HASH from the Write Storage Global Index and read the content of the Object from the Database that owns the Shard UUID.&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects save space? =&lt;br /&gt;
&lt;br /&gt;
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.&lt;br /&gt;
&lt;br /&gt;
If there are billions of objects (i.e., less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e., bigger than 16KB is not small), then packing will save space.&lt;br /&gt;
&lt;br /&gt;
In the simplest method of packing (i.e., appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.&lt;br /&gt;
&lt;br /&gt;
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is more efficient to [https://en.wikipedia.org/wiki/Perfect_hash_function add a minimal hash table] to the Shard so that finding an object is O(1) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.&lt;br /&gt;
&lt;br /&gt;
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB which translates to a [https://forge.softwareheritage.org/T3052#58864 35% space amplification]. Another example is MinIO with [https://github.com/minio/minio/issues/7395#issuecomment-475161144 over 200% space amplification] or [https://wiki.openstack.org/wiki/Swift/ideas/small_files#Challenges Swift] for which [https://www.ovh.com/blog/dealing-with-small-files-with-openstack-swift-part-2/ packing small files was recently proposed].&lt;br /&gt;
&lt;br /&gt;
To summarize, the overhead of storing M Objects totaling S bytes with M=100 billions and S=10PB is:&lt;br /&gt;
&lt;br /&gt;
* '''packed:''' ~15.5PB&lt;br /&gt;
** (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes&lt;br /&gt;
** (M * 24) = 100G Objects * 24 bytes = 2.4TB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
* '''not packed:''' ~17.5PB based on the optimistic assumption that the storage system has a 25% space overhead for small files&lt;br /&gt;
** 25% for space amplification = 10PB * 0.25 = 2.5PB&lt;br /&gt;
** 50% for durability = 10PB * 0.5 = 5PB&lt;br /&gt;
&lt;br /&gt;
= How does packing Objects help with enumeration? =&lt;br /&gt;
&lt;br /&gt;
For mirroring or running an algorithm on all objects, they must be enumerated. If they are not packed together in any way, which is the case with MinIO or Swift, they must be looked up individually. When they are packed together (one million or more), the reader can download an entire Shard instead, saving the accumulated delay imposed by millions of individual lookup.&lt;br /&gt;
&lt;br /&gt;
If looking up an individual Object takes 10 milliseconds and Shards can be read at 100MB/s:&lt;br /&gt;
&lt;br /&gt;
* Getting 1 billion objects requires 10 millions seconds which is over 100 days.&lt;br /&gt;
* One billion objects is 1/10 of the current content of Software Heritage, i.e. ~75TB which can be transferred by reading the Shards in less than 10 days&lt;br /&gt;
&lt;br /&gt;
= Storage solutions and TCO =&lt;br /&gt;
&lt;br /&gt;
When looking for off-the-shelf solutions all options were considered, [https://forge.softwareheritage.org/T3107 including distributed file systems such as IPFs and more] and most of them were discarded because they had at least one blocker that could not be fixed (e.g. no feature to guarantee the durability of an object). In the end a few remained, either including the following features or with the possibility for a third party to contribute them back to the project:&lt;br /&gt;
&lt;br /&gt;
* '''Scale''' to 100 billions objects&lt;br /&gt;
* Provide object '''packing'''&lt;br /&gt;
* Provide detailed '''documentation''' and community support for system administrators operating the storage&lt;br /&gt;
* Be thoroughly '''tested''' before a stable release is published&lt;br /&gt;
* Be '''packaged''' for at least one well known distribution&lt;br /&gt;
* Have '''stable releases''' maintained for at least two years&lt;br /&gt;
* A sound approach to address '''security''' problems (CVE etc.)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! Name&lt;br /&gt;
! RGW&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Scaling&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Packing&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Documentation&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
| Good&lt;br /&gt;
| Good&lt;br /&gt;
| Terse&lt;br /&gt;
|-&lt;br /&gt;
| Tests&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
| Average&lt;br /&gt;
| Good&lt;br /&gt;
| Few&lt;br /&gt;
|-&lt;br /&gt;
| Packages&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Stable releases&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|-&lt;br /&gt;
| Security&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Does not have stable releases and testing ==&lt;br /&gt;
&lt;br /&gt;
The performance goals, size distribution and the number of objects in Software Heritage are similar to what is described in the 2010 article “[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf Finding a needle in Haystack: Facebook’s photo storage]” that motivated the implementation of [https://github.com/chrislusf/seaweedfs SeaweedFS] in 2013 or [https://github.com/linkedin/ambry Ambry], the object storage published in 2017 by LinkedIn to store and serve trillions of media objects in web companies.&lt;br /&gt;
&lt;br /&gt;
Contributing to SeaweedFS or Ambry so they can be deployed and maintained would require:&lt;br /&gt;
&lt;br /&gt;
* Creating packages for the target Operating System (e.g. Debian GNU/Linux), maintaining a repository to distribute them, upload them to the official distribution repository so that they are available in the next stable release (about two years from now)&lt;br /&gt;
* Creating Ansible roles or Puppet modules for deployment on multiple machines&lt;br /&gt;
* Improving the documentation with a configuration and architecture guide to deploy at scale&lt;br /&gt;
* Discuss with upstream to create stable releases, define their lifecycle and organize release management&lt;br /&gt;
* Establish a security team in charge of handling the CVE&lt;br /&gt;
* Setup and infrastructure and create the software for integration testing to be run before a stable release is published to reduce the risk of regressions or data loss. This is specially important because a significant part of the software is dedicated to data storage and replication: bugs can lead to data loss or corruption.&lt;br /&gt;
&lt;br /&gt;
== Does not provide object packing ==&lt;br /&gt;
&lt;br /&gt;
[https://min.io/ MinIO] and [https://docs.openstack.org/swift/latest/ Swift] suffer from a space amplification problem and they do not provide object packing. Although [https://docs.ceph.com/en/latest/radosgw/ Ceph Object Gateway] (also known as RGW) stores objects in RocksDB instead of files, it also suffers from a space amplification problem and does not provide object packing.&lt;br /&gt;
&lt;br /&gt;
Contributing to RGW, MinIO or Swift to add object packing would require:&lt;br /&gt;
&lt;br /&gt;
* Creating a blueprint to modify the internals to add object packing&lt;br /&gt;
* Discuss with upstream to validate the blueprint&lt;br /&gt;
* Implement the blueprint and the associated tests&lt;br /&gt;
&lt;br /&gt;
== Estimating the TCO ==&lt;br /&gt;
&lt;br /&gt;
Since no solution can be used as is, some work must be done in each case and the effort it requires should be compared. It is however difficult because the nature of the effort is different. The following factors were considered and aggregated in a TCO estimate.&lt;br /&gt;
&lt;br /&gt;
* '''Data loss risk:''' if a bug in the work done implies the risk of losing data, it makes the work significantly more complicated. It is the case if packing must be implemented in the internals of an existing object storage such as Swift. It is also the case if an object storage does not have integration testing to verify upgrading to a newer version won’t lead to a regression, which is the case with Ambry. It is likely that the Ambry upstream has extensive integration testing but they are not published.&lt;br /&gt;
* '''Large codebase:''' a large codebase means modifying it (to implement packing) or distributing it (packaging and documentation) is more difficult&lt;br /&gt;
* '''Language:''' if the language and its environment is familiar to the developers and the system administrators, the work is less difficult&lt;br /&gt;
* '''Skills:''' if the work requires highly specialized skills (such as an intimate understanding of how a distributed storage system guarantees a strict consistency of the data, or running integration tests that require a cluster of machines) it is more difficult&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!&lt;br /&gt;
! RGW&lt;br /&gt;
! SeaweedFS&lt;br /&gt;
! MinIO&lt;br /&gt;
! Swift&lt;br /&gt;
! Ambry&lt;br /&gt;
|-&lt;br /&gt;
| Data loss risk&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Large codebase&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
| Yes&lt;br /&gt;
|-&lt;br /&gt;
| Language&lt;br /&gt;
| C++&lt;br /&gt;
| Go&lt;br /&gt;
| Go&lt;br /&gt;
| Python&lt;br /&gt;
| Java&lt;br /&gt;
|-&lt;br /&gt;
| Skills&lt;br /&gt;
| Medium&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|-&lt;br /&gt;
| TCO estimate&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
| High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In a nutshell, implementing a system from scratch has the lowest TCO estimate, primarily because it is independent of the underlying distributed storage.&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=1719</id>
		<title>Suggestion box: source code to add</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Suggestion_box:_source_code_to_add&amp;diff=1719"/>
		<updated>2022-07-19T21:15:32Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Suggestions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [[Archive]] is growing organically. We started &amp;quot;small&amp;quot;, tracking 3 '''software origins''' (GitHub + Debian + GNU), and we will be adding new origins bit by bit, depending on the urgency of archiving them and available development energies to integrate them into Software Heritage.&lt;br /&gt;
&lt;br /&gt;
Using this page you can add suggestions of software origins that we aren't following yet, but we should. You can include information about who to contact for technical collaboration, the urgency of archival, and other useful information. To that end, just add a row to the table below. Here some information about the meaning of the various columns.&lt;br /&gt;
&lt;br /&gt;
Entries are currently listed simply in order of addition to this page; we will add more structure when the list will start growing.&lt;br /&gt;
&lt;br /&gt;
=== Legend ===&lt;br /&gt;
&lt;br /&gt;
;Software origin&lt;br /&gt;
: any (public accessible) &amp;quot;place&amp;quot; on the Internet that host software in source code form. Please provide a title for it and hyperlink it to the relevant URL&lt;br /&gt;
;Type of origin&lt;br /&gt;
: information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.)&lt;br /&gt;
;Contact&lt;br /&gt;
: who to contact for technical collaboration on how to best archive source code hosted on the software origin. You can list yourself if you're the relevant person, or provide the most relevant contact point if you know it&lt;br /&gt;
;Conservation status&lt;br /&gt;
: information about how likely it is that the software origin will disappear; high likelihood will make it more urgent for us to archive software hosted there. We suggest to use the [https://en.wikipedia.org/wiki/Conservation_status species conservation status], i.e., one of: Critically endangered (CR), Endangered (EN), Vulnerable (VU), Near threatened (NT), Least concern (LC).&lt;br /&gt;
;How to mirror&lt;br /&gt;
: (pointers to) technical information on how to do a full mirror of ''all'' the source code available at the software origin, ideally one shot and in batch&lt;br /&gt;
;How to keep up&lt;br /&gt;
: (pointers to) technical information on how to incrementally retrieve new source code accumulated since the last visit; usually this should be based on some kind of incremental change feed or event API&lt;br /&gt;
;Notes&lt;br /&gt;
: anything else you think we should know about this software origin&lt;br /&gt;
;Forge task&lt;br /&gt;
: pointer to the task on our [[forge]] tracking the work needed to ingest the software origin&lt;br /&gt;
&lt;br /&gt;
== Suggestions ==&lt;br /&gt;
&lt;br /&gt;
* https://notabug.org/ (customized gogs)&lt;br /&gt;
* https://gitgud.io/ (GitLab, run by Sapphire, a user-supported open source company)&lt;br /&gt;
* https://octo.sh/ (GitLab)&lt;br /&gt;
* https://chaos.expert/explore (GitLab by Chaos Computer Club)&lt;br /&gt;
* https://gitlab.coko.foundation/public (GitLab)&lt;br /&gt;
* https://git.teknik.io/explore/repos (Gitea)&lt;br /&gt;
* https://gitlab.gnome.org/explore/groups (GNOME software)&lt;br /&gt;
* https://launchpad.net/ (used by Ubuntu and others)&lt;br /&gt;
* https://archive.codeplex.com/ (was Microsoft's free, open source project hosting). It is now dead, but there seems to be a dump here: https://archive.org/details/sylirana_ms_codeplex_zips&lt;br /&gt;
* https://wiki.p2pfoundation.net/List_of_Community-Hosted_GitLab_Instances&lt;br /&gt;
* https://0xacab.org/explore (Gitlab)&lt;br /&gt;
* https://git.fosscommunity.in/explore/projects (GitLab by Free Software Community of India)&lt;br /&gt;
* https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/forge_comparison.md &amp;lt;= list of forges to eventually sync with the table below.  Note that there may be some [https://framablog.org/2019/09/26/lets-de-frama-tify-the-internet  urgency] to harvest framagit.org.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!Software origin&lt;br /&gt;
!Type of origin&lt;br /&gt;
!Contact&lt;br /&gt;
!Conservation status&lt;br /&gt;
!How to mirror&lt;br /&gt;
!How to keep up&lt;br /&gt;
!Notes&lt;br /&gt;
![https://forge.softwareheritage.org/ Forge] task&lt;br /&gt;
|-&lt;br /&gt;
|'''(sample entry)''' ''GitHubBub forge''&lt;br /&gt;
|''Git hosting''&lt;br /&gt;
|''John Doe &amp;lt;john@example.com&amp;gt;''&lt;br /&gt;
|''LC''&lt;br /&gt;
|''retrieve full repo list at /api/list, then git clone on each entry''&lt;br /&gt;
|''poll RSS feed at /api/updates?since=YYYY-MM-DD''&lt;br /&gt;
|''nothing special to add''&lt;br /&gt;
|''[https://forge.softwareheritage.org/T123456 T123456]''&lt;br /&gt;
|-&lt;br /&gt;
|[https://bitbucket.org/ Bitbucket]&lt;br /&gt;
|Git and hg/Mercurial hosting&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;| LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|[https://forge.softwareheritage.org/T561 T561]&lt;br /&gt;
|-&lt;br /&gt;
|[https://sourceforge.net/ SourceForge]&lt;br /&gt;
|CVS, SVN, Mercurial, Git&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: orange&amp;quot;|VU&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://wiki.debian.org/Derivatives/Census all Debian derivatives]&lt;br /&gt;
|Debian-based distros&lt;br /&gt;
|Paul Wise &amp;lt;pabs@debian.org&amp;gt;&lt;br /&gt;
|varying, depending on the distro&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|see [[Suggestion_box:_source_code_to_add/Debian_derivatives|details]]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.gentoo.org/ Gentoo]&lt;br /&gt;
|&lt;br /&gt;
|Johannes Kellner &amp;lt;gentoo@johannes-kellner.eu&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://pauillac.inria.fr/~huet/cea.html Gérard Huet's seminal work on 3D]&lt;br /&gt;
|Scanned source code&lt;br /&gt;
|Gérard Huet &amp;lt;gerard.huet@inria.fr&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: red&amp;quot;|EN&lt;br /&gt;
|retrieve listing images from the web pages&lt;br /&gt;
|N/A&lt;br /&gt;
|links are half broken, yquem should be replaced with pauillac everywhere it appears&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.softwarepreservation.org/projects Software Preservation Project]&lt;br /&gt;
|Website with a collection of archives&lt;br /&gt;
|Paul McJones &amp;lt;paul@mcjones.org&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://code.nasa.gov/ 253 NASA open source software projects]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://smaky.ch/ Smaky], the swiss micro-computer series&lt;br /&gt;
|[http://infini.smaky.ch/sources.html Partial code dump]&lt;br /&gt;
|[mailto:arnaud@epsitec.ch Pierre Arnaud] (current CEO of Epsitec) and/or [mailto:jean-daniel.nicoud@epfl.ch Jean-Daniel Nicoud] (founder of the computer series]&lt;br /&gt;
|style=&amp;quot;background-color: red&amp;quot;|EN&lt;br /&gt;
|Probably manually&lt;br /&gt;
|No new updates&lt;br /&gt;
|Some references to this history: [http://www.memoires-informatiques.org/ Fondation Mémoires Informatiques], [http://smaky.ch/ Smaky.ch] (in particular, [http://smaky.ch/theme.php?id=lami the short history]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS#Conservation_status_2 wikidata endangered software]&lt;br /&gt;
|depends on the &amp;quot;source code repository&amp;quot; property&lt;br /&gt;
|Loic Dachary &amp;lt;loic@dachary.org&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: yellow&amp;quot;|The risk is higher than [https://www.wikidata.org/wiki/Property_talk:P141 LC]&lt;br /&gt;
|A script should obtain the &amp;quot;source code repository&amp;quot; property for the software and mirror it depending on the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS#source_code_repository protocol] qualifier. If the &amp;quot;source code repository&amp;quot; is &amp;quot;no value&amp;quot;, the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/Software#streaming_media_URL streaming media URL] of the &amp;quot;preferred&amp;quot; [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/Software#software_version_.28P348.29 software version] should be downloaded instead.&lt;br /&gt;
|Once a copy is secured by software heritage, a URL to the software heritage repository should be added to the &amp;quot;source code repository&amp;quot; property and the &amp;quot;conservation status&amp;quot; property should be removed, meaning it is &amp;quot;least concerned&amp;quot; by default. The software will no longer show in the list of endangered software.&lt;br /&gt;
|This is work in progress, part of the [https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/FLOSS wikidata FLOSS project] and the scripts do not exist yet.&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|historical KDE repos&lt;br /&gt;
|CVS/SVN/Git&lt;br /&gt;
|KDE sysadmin team &amp;lt;sysadmin@kde.org&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: yellow&amp;quot;|NT&lt;br /&gt;
|See [[Suggestion box: source code to add/KDE|details]]&lt;br /&gt;
|See [[Suggestion box: source code to add/KDE|details]]&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://java.net/projects Java.net] &amp;amp; [https://kenai.com/ Kenai.com]&lt;br /&gt;
|hg, git, svn&lt;br /&gt;
|communitymanager@java.net&lt;br /&gt;
|style=&amp;quot;background-color: black; color: white&amp;quot;|CR&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|[https://community.oracle.com/community/java/javanet-forge-sunset Shutting down on April 28, 2017]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://fedorahosted.org/ fedorahosted.org]&lt;br /&gt;
|git, svn, hg, bzr&lt;br /&gt;
|[https://lists.fedoraproject.org/admin/lists/infrastructure@lists.fedoraproject.org infrastructure@lists.fedoraproject.org]&lt;br /&gt;
|style=&amp;quot;background-color: black; color: white&amp;quot;|CR&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|[https://communityblog.fedoraproject.org/fedorahosted-sunset-2017-02-28/ Shutting down on Feb. 28, 2017]&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.societe-informatique-de-france.fr/wp-content/uploads/2015/12/1024-no7-Baude.pdf Langage symbolique d'Enseignement (LSE)]&lt;br /&gt;
|archives&lt;br /&gt;
|Association Enseignement Public et Informatique (EPI) &amp;lt;bureau@epi.asso.fr&amp;gt;&lt;br /&gt;
|style=&amp;quot;background-color: black; color: white&amp;quot;|CR&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
| + educational software (INRP-CNDP)&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[http://www.netlib.org The Netlib collection of numerical software]&lt;br /&gt;
|structured website with links to archives&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|many of these libraries are mirrored in sources already collected in Software Heritage; there is sure value in the curation information.&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://codebender.cc/ codebender]&lt;br /&gt;
|IOT and educationnal ressources&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: black; color: white&amp;quot;|CR&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|closing expected at Dec31st, 2016 upon this announce by founders https://codebender.cc/next-chapter&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://git.oschina.net/ OS China]&lt;br /&gt;
|Chinese GitHub equivalent&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|interesting test case for all the unicode tooling in Software Heritage&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Usenet source code archives&lt;br /&gt;
|NNTP&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: yellow&amp;quot;|NT&lt;br /&gt;
|crawl relevant newsgroup archives (e.g., at Google Groups), retrieve archives (possibly chunked), ingest&lt;br /&gt;
|one shot might be enough?&lt;br /&gt;
|suggestion by John Gilmore&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.x.org/releases/ X11/XOrg archives]&lt;br /&gt;
|http&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|&lt;br /&gt;
|one shot might be enough&lt;br /&gt;
|&lt;br /&gt;
||''[https://forge.softwareheritage.org/T1774 T1774]''&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.curseforge.com/ Curse mods]&lt;br /&gt;
|Code distributed as versioned tarballs&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|The API should be sufficient, maybe some scraping will be required&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Symbian source code&lt;br /&gt;
|HG&lt;br /&gt;
|carlo.daffara@nodeweaver.eu&lt;br /&gt;
|style=&amp;quot;background-color: orange&amp;quot;|VU&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Cloned from Nokia's Symbian Mercurial repository, a few days before the closure of the repo and the change to a proprietary license. The mercurial HG files are the only copy publicly available; I have made a snapshot of the code and placed on sourceforge- but missing is the entire project history and commit log.&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://gist.github.com/ GitHub's gists]&lt;br /&gt;
|git&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|No proper listing API, but there's [https://developer.github.com/v3/gists/#list-all-public-gists an endpoint] to get gists created after a given date&lt;br /&gt;
|use the same endpoint&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://codeberg.org/ Codeberg]&lt;br /&gt;
|git&lt;br /&gt;
|contact@codeberg.org&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|[https://github.com/go-gitea Gitea] API&lt;br /&gt;
|[https://github.com/go-gitea Gitea] API&lt;br /&gt;
|Codeberg e.V. is a Non-Profit Collaboration Community for Free and Open Source Projects&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://puszcza.gnu.org.ua/ Puszcza]&lt;br /&gt;
|cvs/git/hg/svn, VCS snapshot tarballs, tarballs&lt;br /&gt;
|https://puszcza.gnu.org.ua/contact.php&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|Savane instance, https://download.gnu.org.ua/ (also has ftp), http://git.gnu.org.ua/&lt;br /&gt;
|Savane instance&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://directory.fsf.org/ Free Software Directory]&lt;br /&gt;
|solely contains links (to VCS and tarballs)&lt;br /&gt;
|https://lists.gnu.org/archive/html/directory-discuss/&lt;br /&gt;
|style=&amp;quot;background-color: lightgreen&amp;quot;|LC&lt;br /&gt;
|MediaWiki instance, use API to clone (or git-remote-mediawiki)&lt;br /&gt;
|MediaWiki instance, use API for updates clone (or git-remote-mediawiki)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://kb.netgear.com/2649/NETGEAR-Open-Source-Code-for-Programmers-GPL NetGear GPL tarballs]&lt;br /&gt;
|Tarballs&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: yellow&amp;quot;|NT&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|[https://www.ui.com/download/edgemax/edgerouter/default/edgerouter-er-8erpro-8ep-r8-firmware-v11011 Ubiquity firmware GPL tarballs]&lt;br /&gt;
|Tarballs&lt;br /&gt;
|&lt;br /&gt;
|style=&amp;quot;background-color: yellow&amp;quot;|NT&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
[[Category:Archive]]&lt;br /&gt;
[[Category:Suggestions]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1713</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1713"/>
		<updated>2022-04-18T17:23:16Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* I want to participate as a student */ Update link&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [https://docs.softwareheritage.org/devel/contributing/phabricator.html? code review workflow]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow (no need to ask for permissions). [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project (check individual idea pages for expected duration and difficulty of each task):&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a list of internship topics below, which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel). Prefer public channels over contacting mentors directly.&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1711</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1711"/>
		<updated>2022-04-02T06:33:48Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Task description */ Add example of data sources&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
These sources include data received from [https://docs.softwareheritage.org/devel/swh-deposit/index.html deposit clients] like HAL and data we will archive from [https://swmath.org/ swMATH], [https://github.com/ GitHub.com], ...&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
# there are multiple sources, and their contents must work together&lt;br /&gt;
# sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Expected duration ==&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Vincent Sellier (vsellier on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;br /&gt;
[[Category:Available GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Source_code_search_engine_prototype_(internship)&amp;diff=1707</id>
		<title>Source code search engine prototype (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Source_code_search_engine_prototype_(internship)&amp;diff=1707"/>
		<updated>2022-03-16T11:12:37Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: update numbers&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Internship&lt;br /&gt;
|description=The current [https://archive.softwareheritage.org/browse/search/ archive search engine] supports searching archived projects (or &amp;quot;origins&amp;quot;) via their URL or metadata.&lt;br /&gt;
We would like to extend search to also support searching within archived source code files, based on their textual content.&lt;br /&gt;
Indexing all source code files archived by Software Heritage (~500-600 TB) is a major undertaking in terms of indexing time and storage.&lt;br /&gt;
The goal of this internship is to design and implement a medium-scale prototype of such an index (covering, e.g., 0.1 to 1% of the archive) that will allow to evaluate the best indexing approach (e.g., which kind of index, tokenizer, etc.) as well as the time and resources that doing so will require (e.g., via extrapolation).&lt;br /&gt;
The first technology that will be tried is [https://www.elastic.co/ ElasticSearch], as it's already deployed for other Software Heritage needs, but depending on the candidate other search engines can also be tested in the context of the internship.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* database administration experience (SQL and/or NoSQL and/or document-oriented)&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with full-text or code search&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Available internship]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1706</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1706"/>
		<updated>2022-03-11T14:35:18Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Contact */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project (check individual idea pages for expected duration and difficulty of each task):&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a list of internship topics below, which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel). Prefer public channels over contacting mentors directly.&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1702</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1702"/>
		<updated>2022-03-10T14:45:44Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: add vsellier as mentor&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
1. there are multiple sources, and their contents must work together&lt;br /&gt;
2. sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Vincent Sellier (vsellier on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;br /&gt;
[[Category:Available GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1701</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1701"/>
		<updated>2022-03-10T13:55:44Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
1. there are multiple sources, and their contents must work together&lt;br /&gt;
2. sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;br /&gt;
[[Category:Available GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1700</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1700"/>
		<updated>2022-03-10T13:54:57Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
1. there are multiple sources, and their contents must work together&lt;br /&gt;
2. sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1699</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1699"/>
		<updated>2022-03-10T13:54:50Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Created page with &amp;quot;== Introduction ==  The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive] features a small search engine, that searched in project URLs and pro...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
1. there are multiple sources, and their contents must work together&lt;br /&gt;
2. sources have different reliability, that should be taken into account&lt;br /&gt;
   when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1698</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1698"/>
		<updated>2022-03-09T19:10:27Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Ideas list */ re-add the list of internship&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project (check individual idea pages for expected duration and difficulty of each task):&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a list of internship topics below, which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1695</id>
		<title>Create a browser extension (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1695"/>
		<updated>2022-03-07T20:01:22Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Expected duration */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
When browsing a repository (on GitHub, Gitlab, ...) or a package description (on NPM, Debian),&lt;br /&gt;
people may want to check when (and if) this repository or package was last archived in Software Heritage.&lt;br /&gt;
Currently, this means opening the archive in a new tab, and searching for the URL, and looking at the status.&lt;br /&gt;
Then, they can trigger a new archival with another few clicks (via the &amp;quot;Save Code Now&amp;quot; feature).&lt;br /&gt;
&lt;br /&gt;
This workflow may be streamlined by the [https://forge.softwareheritage.org/T3756 creation of a browser extension or bookmarklet].&lt;br /&gt;
This extension/bookmarklet would, for example, show an icon next to the URL bar to show the status of the currently visited repository;&lt;br /&gt;
and clicking it would show details (like the date of last visit) and run &amp;quot;Save Code Now&amp;quot; in just two clicks.&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Javascript experience is also needed for this project&lt;br /&gt;
* Prior experience in working with browser extensions is a plus&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]]) &lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1692</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1692"/>
		<updated>2022-02-25T10:06:41Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Ideas list */ Replace the internship list with a link, as we can't provide the duration+difficulty information in GSoC terms directly on internship pages&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;div style=&amp;quot;text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;&amp;quot;&amp;gt;&lt;br /&gt;
This page is a work in progress; the list of retained organizations for Google Summer of Code 2022 has not been determined yet.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project.&lt;br /&gt;
They include both GSoC-specific tasks (that are only available as part of our GSoC participation)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a [[Internships|list of internship topics]], which you can use for ideas when applying to GSoC with us. Expect them to be on the harder than usual GSoC tasks, and they will require 350 hours.&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1678</id>
		<title>Mine information from external sources (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1678"/>
		<updated>2022-02-25T07:18:23Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Add missing required info&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archiving source code artifacts, Software Heritage is interested in&lt;br /&gt;
archive metadata from external sources and correlate it to source code artifacts.&lt;br /&gt;
This is also to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Collecting this extrinsic metadata is a&lt;br /&gt;
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome&lt;br /&gt;
to contribute to its implementation.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
You would contribute to the design of our metadata-fetching architecture.&lt;br /&gt;
This includes:&lt;br /&gt;
&lt;br /&gt;
* Review what metadata we want to fetch&lt;br /&gt;
* How to efficiently fetch it at regular intervals and store it&lt;br /&gt;
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources&lt;br /&gt;
&lt;br /&gt;
Expected duration: 175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: hard&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with software metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1677</id>
		<title>Mine information from archived content (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1677"/>
		<updated>2022-02-25T07:17:56Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Add missing required info&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archival, Software Heritage indexes the retrieved source code&lt;br /&gt;
artifacts, to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Indexing can happen at the individual file-level (e.g., detect the programming&lt;br /&gt;
language the file is written in or the license declared in its header), or at&lt;br /&gt;
more coarse grained granularity (e.g., what metadata are declared for the most&lt;br /&gt;
recently archived version of a given project).&lt;br /&gt;
&lt;br /&gt;
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],&lt;br /&gt;
such as:&lt;br /&gt;
&lt;br /&gt;
* file level mining:&lt;br /&gt;
** MIME type detection (using libmagic)&lt;br /&gt;
** license detection (using FOSSology/nomossa)&lt;br /&gt;
** language detection (using Pygments)&lt;br /&gt;
** ctags extraction (using universal-ctags)&lt;br /&gt;
* project level mining:&lt;br /&gt;
** Ruby gemspec metadata&lt;br /&gt;
** Python PKG-INFO metadata&lt;br /&gt;
** Maven pom.xml metadata&lt;br /&gt;
** NPM package.json metadata&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Writing additional indexers that extract more information from archived source&lt;br /&gt;
code is welcome and would constitute a suitable GSoC project.&lt;br /&gt;
&lt;br /&gt;
Name the kind of data mining you want to do!&lt;br /&gt;
&lt;br /&gt;
For inspiration you can have a look at [https://libraries.io Libraries.io], as&lt;br /&gt;
most package formats/package managers support dedicated ways of expressing&lt;br /&gt;
metadata and we only support a small number of them up-to-now. But do not&lt;br /&gt;
restrict your ambition to those, any kind of data extraction/mining you want to&lt;br /&gt;
do on the archive could work.&lt;br /&gt;
&lt;br /&gt;
You may also add support for multiple formats at once, using an external tool,&lt;br /&gt;
such as [https://github.com/datacite/bolognese Bolognese] or&lt;br /&gt;
[https://github.com/librariesio/bibliothecary/ bibliothecary].&lt;br /&gt;
&lt;br /&gt;
Expected duration: 175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with (source code) metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1676</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1676"/>
		<updated>2022-02-25T07:17:04Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic (currently implemented mostly in swh/deposit/api/common.py, but is tightly coupled with the views)&lt;br /&gt;
* client-side business logic &lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Possible extensions include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
Expected duration: 175 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1675</id>
		<title>Improve and extend the archive Web UI (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1675"/>
		<updated>2022-02-25T07:16:32Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Add missing required info&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Several improvements are possible on the archive Web interface and would make&lt;br /&gt;
great GSoC projects, some ideas to whet your appetite:&lt;br /&gt;
&lt;br /&gt;
* add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ... (note that this will also require backend design and implementation)&lt;br /&gt;
* improve [https://www.w3.org/WAI/ accessibility]&lt;br /&gt;
* display metadata we already mined from the archive&lt;br /&gt;
&lt;br /&gt;
Expected duration:&lt;br /&gt;
&lt;br /&gt;
* if you choose the developer-oriented features: 350 hours. Difficulty: hard&lt;br /&gt;
* for others mentioned above: 175 hours. Difficulty: easy&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Django&lt;br /&gt;
* web development and/or design&lt;br /&gt;
* Javascript knowledge is useful but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
[[Improve project search engine (GSoC task)]] is an independent task,&lt;br /&gt;
which may or may not involve improvements to the Web UI, depending on&lt;br /&gt;
you tastes.&lt;br /&gt;
&lt;br /&gt;
While this task can be about displaying metadata we already mined,&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to mine more&lt;br /&gt;
of this metadata; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1674</id>
		<title>Dashboard UI for the Code Scanner (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1674"/>
		<updated>2022-02-25T07:14:37Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.&lt;br /&gt;
&lt;br /&gt;
The Software Heritage scanner (&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt;) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code], [https://upsilon.cc/~zack/talks/2021/2021-04-07-llw.pdf slides of a 2021 presentation about swh-scanner]) is a command line tool that enables doing that.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt; is currently an experimental tool, which works well in practice, but need a real '''dashboard user interface''' to be useful.&lt;br /&gt;
Several output options are currently available when invoking the &amp;lt;code&amp;gt;swh scanner scan&amp;lt;/code&amp;gt; command, in particular batch output in textual and JSON format, and an interactive &lt;br /&gt;
dashboard (with the &amp;lt;code&amp;gt;-i/--interactive&amp;lt;/code&amp;gt;) option.&lt;br /&gt;
&lt;br /&gt;
The interactive view currently works by producing a local HTML file and opening it using the local browser.&lt;br /&gt;
The goal of this project is to '''improve the interactive view''', making it a serious dashboard-style UI to peruse scanning results.&lt;br /&gt;
&lt;br /&gt;
The following improvements are suggested, although more can be proposed (and even more could be discovered during the project work):&lt;br /&gt;
&lt;br /&gt;
* Technology: generating a local HTML file is not necessarily the best way to render results, alternative solutions should be explored, including a self-hosted web app, rendering results with state-of-the-art frontend web frameworks (css/html/javascript)&lt;br /&gt;
* Scalability: currently rendering doesn't work when scanning large code bases such as the Linux kernel, rendering should be made lazy, by only loading data to show when needed&lt;br /&gt;
* Functionality: dashboard rendering should be integrated with the possibility of opening the local source code files that have been scanned, e.g., users will want to be able to open in-browser files that have been detected as known/unknown, in order to figure why&lt;br /&gt;
* Functionality: in the future additional information will be added to scanning results, including license and provenance information. While not yet available right now due to backend limitations, the proposed UI should plan ahead about how/where to display such information&lt;br /&gt;
* Paper cuts: [https://forge.softwareheritage.org/tag/code_scanner/ various issues] affect the usability of swh-scanner, improving them would be welcome as part of this project&lt;br /&gt;
&lt;br /&gt;
Expected duration: 350 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* HTML/CSS/JavaScript and web development in general&lt;br /&gt;
* Working knowledge of UI/UX design principles&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_embeddable_widgets_(GSoC_task)&amp;diff=1673</id>
		<title>Create embeddable widgets (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_embeddable_widgets_(GSoC_task)&amp;diff=1673"/>
		<updated>2022-02-25T07:14:22Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Add missing required info&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Create embeddable JS widgets to make SWH features easily available.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
The idea is to have a set of JS widgets that can be embedded in any web page with JS enabled.&lt;br /&gt;
Widgets will be smart enough to make their own api calls and render the results.&lt;br /&gt;
&lt;br /&gt;
Some widgets could be&lt;br /&gt;
* SWH search box&lt;br /&gt;
* SWH Search results&lt;br /&gt;
* SWH Browse (for Project/revision or file)&lt;br /&gt;
* Save code now.&lt;br /&gt;
&lt;br /&gt;
Expected duration: 350 hours. Difficulty: hard&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* web development and/or design&lt;br /&gt;
* Javascript&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1672</id>
		<title>Create a browser extension (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1672"/>
		<updated>2022-02-25T07:13:55Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Add missing required info&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
When browsing a repository (on GitHub, Gitlab, ...) or a package description (on NPM, Debian),&lt;br /&gt;
people may want to check when (and if) this repository or package was last archived in Software Heritage.&lt;br /&gt;
Currently, this means opening the archive in a new tab, and searching for the URL, and looking at the status.&lt;br /&gt;
Then, they can trigger a new archival with another few clicks (via the &amp;quot;Save Code Now&amp;quot; feature).&lt;br /&gt;
&lt;br /&gt;
This workflow may be streamlined by the [https://forge.softwareheritage.org/T3756 creation of a browser extension or bookmarklet].&lt;br /&gt;
This extension/bookmarklet would, for example, show an icon next to the URL bar to show the status of the currently visited repository;&lt;br /&gt;
and clicking it would show details (like the date of last visit) and run &amp;quot;Save Code Now&amp;quot; in just two clicks.&lt;br /&gt;
&lt;br /&gt;
Expected duration: 175 hours. Difficulty: easy&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Javascript experience is also needed for this project&lt;br /&gt;
* Prior experience in working with browser extensions is a plus&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Jayesh Velayudhan (jayeshv on [[IRC]]) &lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1648</id>
		<title>Create a browser extension (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1648"/>
		<updated>2022-02-11T10:02:37Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: link to the task&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
When browsing a repository (on GitHub, Gitlab, ...) or a package description (on NPM, Debian),&lt;br /&gt;
people may want to check when (and if) this repository or package was last archived in Software Heritage.&lt;br /&gt;
Currently, this means opening the archive in a new tab, and searching for the URL, and looking at the status.&lt;br /&gt;
Then, they can trigger a new archival with another few clicks (via the &amp;quot;Save Code Now&amp;quot; feature).&lt;br /&gt;
&lt;br /&gt;
This workflow may be streamlined by the [https://forge.softwareheritage.org/T3756 creation of a browser extension or bookmarklet].&lt;br /&gt;
This extension/bookmarklet would, for example, show an icon next to the URL bar to show the status of the currently visited repository;&lt;br /&gt;
and clicking it would show details (like the date of last visit) and run &amp;quot;Save Code Now&amp;quot; in just two clicks.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Javascript experience is also needed for this project&lt;br /&gt;
* Prior experience in working with browser extensions is a plus&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1647</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1647"/>
		<updated>2022-02-10T16:12:45Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic (currently implemented mostly in swh/deposit/api/common.py, but is tightly coupled with the views)&lt;br /&gt;
* client-side business logic &lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Possible extensions include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1646</id>
		<title>Create a browser extension (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Create_a_browser_extension_(GSoC_task)&amp;diff=1646"/>
		<updated>2022-02-10T16:07:25Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Created page with &amp;quot;== Introduction ==  As you probably know already, The Software Heritage archive can be [https://archive.softwareheritage.org browsed on the Web]. The [https://forge.softwarehe...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
When browsing a repository (on GitHub, Gitlab, ...) or a package description (on NPM, Debian),&lt;br /&gt;
people may want to check when (and if) this repository or package was last archived in Software Heritage.&lt;br /&gt;
Currently, this means opening the archive in a new tab, and searching for the URL, and looking at the status.&lt;br /&gt;
Then, they can trigger a new archival with another few clicks (via the &amp;quot;Save Code Now&amp;quot; feature).&lt;br /&gt;
&lt;br /&gt;
This workflow may be streamlined by the creation of a browser extension or bookmarklet.&lt;br /&gt;
This extension/bookmarklet would, for example, show an icon next to the URL bar to show the status of the currently visited repository;&lt;br /&gt;
and clicking it would show details (like the date of last visit) and run &amp;quot;Save Code Now&amp;quot; in just two clicks.&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Javascript experience is also needed for this project&lt;br /&gt;
* Prior experience in working with browser extensions is a plus&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1645</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1645"/>
		<updated>2022-02-10T15:58:18Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* Task description */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic&lt;br /&gt;
* client-side business logic&lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Possible extensions include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1644</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1644"/>
		<updated>2022-02-10T15:48:44Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: replace zack with ardumont as mentor&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic&lt;br /&gt;
* client-side business logic&lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Stretch goals include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Dumont &amp;lt;ardumont@softwareheritage.org&amp;gt; (ardumont on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1643</id>
		<title>Make the software deposit service (swh-deposit) modular (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Make_the_software_deposit_service_(swh-deposit)_modular_(GSoC_task)&amp;diff=1643"/>
		<updated>2022-02-10T15:48:08Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Created page with &amp;quot;== Introduction ==  The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.  In addition to fetching...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
In addition to fetching source code from public repositories, it offers to Deposit service, to allow platforms to send code for Software Heritage to archive.&lt;br /&gt;
&lt;br /&gt;
This service is currently written as a monolith, that grew over years to include a complete [http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html SWORDv2] server, a partial SWORDv2 client with extensions, and business logic specific to Software Heritage in both. This makes the current code hard to maintain and impossible to reuse.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-deposit/ swh-deposit] would need to be split into the following parts:&lt;br /&gt;
&lt;br /&gt;
* a generic SWORDv2 server (based on Django)&lt;br /&gt;
* a generic SWORDv2 client&lt;br /&gt;
* server-side business logic&lt;br /&gt;
* client-side business logic&lt;br /&gt;
&lt;br /&gt;
The generic server and client will need to be extensively documented, so they can be reused by other software projects.&lt;br /&gt;
&lt;br /&gt;
Stretch goals include:&lt;br /&gt;
&lt;br /&gt;
* The code should also be designed to allow extensions to support SWORDv3, if we ever need to support it&lt;br /&gt;
* A new administration front-end and/or addition of administrative tools in [https://forge.softwareheritage.org/source/swh-web/ swh-web]&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* Experience with Django&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz &amp;lt;vlorentz@softwareheritage.org&amp;gt; (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1642</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1642"/>
		<updated>2022-02-08T12:02:33Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;div style=&amp;quot;text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;&amp;quot;&amp;gt;&lt;br /&gt;
This page is a work in progress; Software Heritage did not apply to Google Summer of Code 2022 yet.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a&lt;br /&gt;
reasonably-sized GSoC project:&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also maintain the following list of [[Internships]].&lt;br /&gt;
They are usually reserved to on-site university students, but during GSoC they are also available as GSoC project ideas:&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Both GSoC tasks and internship topics are just suggestion though, don't feel&lt;br /&gt;
obliged to pick one of them if there is nothing that fits your taste and&lt;br /&gt;
abilities.  Feel free to propose something else that you are excited about and&lt;br /&gt;
that contributes to improve the Software Heritage archive: we will be happy to&lt;br /&gt;
consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1641</id>
		<title>Category:GSoC task</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1641"/>
		<updated>2022-02-08T12:01:15Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Google Summer of Code tasks, past, present, and future.&lt;br /&gt;
&lt;br /&gt;
See the [[Google_Summer_of_Code_2022|main page for GSoC 2022]]&lt;br /&gt;
&lt;br /&gt;
See also: the [[:Category:Available_GSoC_task|list of available tasks]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1640</id>
		<title>Category:GSoC task</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1640"/>
		<updated>2022-02-08T12:00:46Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Google Summer of Code tasks, past, present, and future.&lt;br /&gt;
&lt;br /&gt;
See the [[Google_Summer_of_Code_2022|main page for GSoC 2022]]&lt;br /&gt;
&lt;br /&gt;
See also: the [[Category:Available_GSoC_task|list of available tasks]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:Available_GSoC_task&amp;diff=1639</id>
		<title>Category:Available GSoC task</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:Available_GSoC_task&amp;diff=1639"/>
		<updated>2022-02-08T11:59:51Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Created page with &amp;quot;Google Summer of Code tasks  See the main page for GSoC 2022&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Google Summer of Code tasks&lt;br /&gt;
&lt;br /&gt;
See the [[Google_Summer_of_Code_2022|main page for GSoC 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1638</id>
		<title>Category:GSoC task</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Category:GSoC_task&amp;diff=1638"/>
		<updated>2022-02-08T11:59:29Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Google Summer of Code tasks&lt;br /&gt;
&lt;br /&gt;
See the [[Google_Summer_of_Code_2022|main page for GSoC 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1637</id>
		<title>Mine information from external sources (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1637"/>
		<updated>2022-02-08T11:59:15Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archiving source code artifacts, Software Heritage is interested in&lt;br /&gt;
archive metadata from external sources and correlate it to source code artifacts.&lt;br /&gt;
This is also to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Collecting this extrinsic metadata is a&lt;br /&gt;
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome&lt;br /&gt;
to contribute to its implementation.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
You would contribute to the design of our metadata-fetching architecture.&lt;br /&gt;
This includes:&lt;br /&gt;
&lt;br /&gt;
* Review what metadata we want to fetch&lt;br /&gt;
* How to efficiently fetch it at regular intervals and store it&lt;br /&gt;
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with software metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1636</id>
		<title>Mine information from archived content (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1636"/>
		<updated>2022-02-08T11:59:11Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archival, Software Heritage indexes the retrieved source code&lt;br /&gt;
artifacts, to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Indexing can happen at the individual file-level (e.g., detect the programming&lt;br /&gt;
language the file is written in or the license declared in its header), or at&lt;br /&gt;
more coarse grained granularity (e.g., what metadata are declared for the most&lt;br /&gt;
recently archived version of a given project).&lt;br /&gt;
&lt;br /&gt;
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],&lt;br /&gt;
such as:&lt;br /&gt;
&lt;br /&gt;
* file level mining:&lt;br /&gt;
** MIME type detection (using libmagic)&lt;br /&gt;
** license detection (using FOSSology/nomossa)&lt;br /&gt;
** language detection (using Pygments)&lt;br /&gt;
** ctags extraction (using universal-ctags)&lt;br /&gt;
* project level mining:&lt;br /&gt;
** Ruby gemspec metadata&lt;br /&gt;
** Python PKG-INFO metadata&lt;br /&gt;
** Maven pom.xml metadata&lt;br /&gt;
** NPM package.json metadata&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Writing additional indexers that extract more information from archived source&lt;br /&gt;
code is welcome and would constitute a suitable GSoC project.&lt;br /&gt;
&lt;br /&gt;
Name the kind of data mining you want to do!&lt;br /&gt;
&lt;br /&gt;
For inspiration you can have a look at [https://libraries.io Libraries.io], as&lt;br /&gt;
most package formats/package managers support dedicated ways of expressing&lt;br /&gt;
metadata and we only support a small number of them up-to-now. But do not&lt;br /&gt;
restrict your ambition to those, any kind of data extraction/mining you want to&lt;br /&gt;
do on the archive could work.&lt;br /&gt;
&lt;br /&gt;
You may also add support for multiple formats at once, using an external tool,&lt;br /&gt;
such as [https://github.com/datacite/bolognese Bolognese] or&lt;br /&gt;
[https://github.com/librariesio/bibliothecary/ bibliothecary].&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with (source code) metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Improve_the_Code_Scanner_(GSoC_task)&amp;diff=1635</id>
		<title>Improve the Code Scanner (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Improve_the_Code_Scanner_(GSoC_task)&amp;diff=1635"/>
		<updated>2022-02-08T11:59:06Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.&lt;br /&gt;
&lt;br /&gt;
The Software Heritage scanner (&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt;) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code]) is a command line tool that enables doing that.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt; is currently an experimental tool, which works well in practice, but need some polishing to make it usable in production in real use cases.&lt;br /&gt;
&lt;br /&gt;
Several improvements are possible:&lt;br /&gt;
&lt;br /&gt;
* integrate provenance information in results (using [https://docs.softwareheritage.org/devel/swh-graph/ swh-graph] and/or [https://forge.softwareheritage.org/source/swh-provenance/ swh-provenance])&lt;br /&gt;
* make the different algorithms (see [https://forge.softwareheritage.org/source/swh-scanner/history/benchmark/ benchmark branch] in Git) used to query the backend user-selectable&lt;br /&gt;
* minimize the number of queries to the [https://archive.softwareheritage.org/api/1/known/doc/ /known API endpoint], in order to consume API rate limit less and be generally more efficient&lt;br /&gt;
* be adaptive in how the backend is queried, e.g., for code trees that contain less than 1000 files it is more efficient to just query all of them at once, without following the DAG structure (even if it is in theory a faster approach)&lt;br /&gt;
* improve the web-based dashboard view (&amp;lt;code&amp;gt;--interactive&amp;lt;/code&amp;gt;), making it more user friendly&lt;br /&gt;
* add support for scanning from within repositories (e.g., if the code is in git, we can lookup the commit ID directly)&lt;br /&gt;
* add progress reporting during scanning, in particular for large code bases&lt;br /&gt;
* add on-disk caching, in particular for large code bases&lt;br /&gt;
* integrate into the generated output other information available from the archive, e.g., license information, metadata, provenance information, etc.&lt;br /&gt;
* general code improvements, including refactoring and deduplication w.r.t. the rest of the Software Heritage code base (see [https://forge.softwareheritage.org/tag/code_scanner/ open tasks])&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* JavaScript and front-end web development, if you want to work on the interactive dashboard&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1634</id>
		<title>Improve and extend the archive Web UI (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Improve_and_extend_the_archive_Web_UI_(GSoC_task)&amp;diff=1634"/>
		<updated>2022-02-08T11:58:54Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
As you probably know already, The Software Heritage archive can be&lt;br /&gt;
[https://archive.softwareheritage.org browsed on the Web]. The&lt;br /&gt;
[https://forge.softwareheritage.org/source/swh-web/ code] powering that&lt;br /&gt;
interface is a Django application that also implements a&lt;br /&gt;
[https://archive.softwareheritage.org/api/ Web API].&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Several improvements are possible on the archive Web interface and would make&lt;br /&gt;
great GSoC projects, some ideas to whet your appetite:&lt;br /&gt;
&lt;br /&gt;
* add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ... (note that this will also require backend design and implementation)&lt;br /&gt;
* improve [https://www.w3.org/WAI/ accessibility]&lt;br /&gt;
* display metadata we already mined from the archive&lt;br /&gt;
* help us design and implement our [https://forge.softwareheritage.org/T1805 next API version]&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Django&lt;br /&gt;
* web development and/or design&lt;br /&gt;
* Javascript knowledge is useful but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Antoine Lambert (anlambert on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
[[Improve project search engine (GSoC task)]] is an independent task,&lt;br /&gt;
which may or may not involve improvements to the Web UI, depending on&lt;br /&gt;
you tastes.&lt;br /&gt;
&lt;br /&gt;
While this task can be about displaying metadata we already mined,&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to mine more&lt;br /&gt;
of this metadata; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1633</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1633"/>
		<updated>2022-02-08T11:58:29Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Created page with &amp;quot;512px  == General information ==  This page is the central point of information for Software Heritage participation into the [https://summerofcode.wi...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a&lt;br /&gt;
reasonably-sized GSoC project:&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also maintain the following list of [[Internships]].&lt;br /&gt;
They are usually reserved to on-site university students, but during GSoC they are also available as GSoC project ideas:&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available internship&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Both GSoC tasks and internship topics are just suggestion though, don't feel&lt;br /&gt;
obliged to pick one of them if there is nothing that fits your taste and&lt;br /&gt;
abilities.  Feel free to propose something else that you are excited about and&lt;br /&gt;
that contributes to improve the Software Heritage archive: we will be happy to&lt;br /&gt;
consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)&amp;diff=1563</id>
		<title>Language and infrastructure for analyzing the archive (internship)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Language_and_infrastructure_for_analyzing_the_archive_(internship)&amp;diff=1563"/>
		<updated>2021-05-07T11:42:18Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Fix broken link to Boa&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(voir aussi: [[Language_and_infrastructure_for_analyzing_the_archive_(internship)/fr|version française]] du sujet)&lt;br /&gt;
----&lt;br /&gt;
{{Internship&lt;br /&gt;
|description=&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is structured as a graph (specifically, a&lt;br /&gt;
[https://en.wikipedia.org/wiki/Merkle_tree Merkle DAG]) and is huge: tens of&lt;br /&gt;
billion nodes, hundreds of billion edges.  The graph exhibits a lot of sharing:&lt;br /&gt;
the same source code files and directories can be reached starting from many&lt;br /&gt;
different commits (e.g., different commits in the same repository), and the&lt;br /&gt;
same commits can be reached starting from many different repositories (e.g.,&lt;br /&gt;
repositories that are &amp;quot;forks&amp;quot; of one another).&lt;br /&gt;
When analyzing source code at a very large-scale (e.g, all the commits of the&lt;br /&gt;
same large repository, or even all projects hosted on GitHub) it is pointless,&lt;br /&gt;
and a waste of resources, to re-analyze source code artifacts already analyzed&lt;br /&gt;
in the past, and encountered again in the future due to sharing in the&lt;br /&gt;
graph.&lt;br /&gt;
&lt;br /&gt;
The goal of this internship is to design and implement a prototype platform&lt;br /&gt;
(similar in spirit to [http://design.cs.iastate.edu/papers/ICSE-13/icse13.pdf Boa])&lt;br /&gt;
that allows to describe empirical experiments to be run on the Software&lt;br /&gt;
Heritage archive, exploiting artifact sharing as a way to speed up the&lt;br /&gt;
analysis. The platform will constitute of a simple language to describe&lt;br /&gt;
experiments (e.g., &amp;quot;start from these repositories and run this script on all&lt;br /&gt;
files in each commit&amp;quot;) and of a runtime implementing the language that&lt;br /&gt;
transparently handles caching of previous results.&lt;br /&gt;
As a stretch goal: the runtime will delegate actual compute to multiple&lt;br /&gt;
workers, running either on a single machine or distributed over a cluster.&lt;br /&gt;
&lt;br /&gt;
If successfully implemented, the internship will conclude with a demonstration&lt;br /&gt;
(e.g., in the form of a paper) benchmarking in practice the performance&lt;br /&gt;
advantages of the proposed approach over a naive implementation.&lt;br /&gt;
&lt;br /&gt;
|skills=&lt;br /&gt;
* Python development&lt;br /&gt;
* experience with functional programming&lt;br /&gt;
&lt;br /&gt;
Will be considered a plus:&lt;br /&gt;
* experience with programming language theory and implementation&lt;br /&gt;
* experience with the [https://en.wikipedia.org/wiki/MapReduce MapReduce] programming model&lt;br /&gt;
&lt;br /&gt;
|mentors=&lt;br /&gt;
* Antoine Pietri (seirl on [[IRC]])&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Ongoing internship]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1556</id>
		<title>Mine information from external sources (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1556"/>
		<updated>2021-04-12T09:34:30Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Removing &amp;quot;Possible metadata sources&amp;quot;; we want to review the list after the internship starts; and the wishlist should be in subtasks of https://forge.softwareheritage.org/T2202&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archiving source code artifacts, Software Heritage is interested in&lt;br /&gt;
archive metadata from external sources and correlate it to source code artifacts.&lt;br /&gt;
This is also to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Collecting this extrinsic metadata is a&lt;br /&gt;
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome&lt;br /&gt;
to contribute to its implementation.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
You would contribute to the design of our metadata-fetching architecture.&lt;br /&gt;
This includes:&lt;br /&gt;
&lt;br /&gt;
* Review what metadata we want to fetch&lt;br /&gt;
* How to efficiently fetch it at regular intervals and store it&lt;br /&gt;
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with software metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1555</id>
		<title>Matrix</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Matrix&amp;diff=1555"/>
		<updated>2021-04-12T08:50:22Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: /* IRC access list */ add command to op yourself&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IRC channels ==&lt;br /&gt;
&lt;br /&gt;
The following channels have been registered on the [https://freenode.net/ Freenode] network for [[Software Heritage]] usage.&lt;br /&gt;
&lt;br /&gt;
* [https://app.element.io/#/room/#freenode_#swh-devel:matrix.org '''#swh-devel''']: public development discussions&lt;br /&gt;
* [https://app.element.io/#/room/#freenode_#swh-team:matrix.org '''#swh-team''']: private discussions of the core team&lt;br /&gt;
* [https://app.element.io/#/room/#freenode_#swh-sysadm:matrix.org '''#swh-sysadm''']: operations team discussions/bots&lt;br /&gt;
* [https://app.element.io/#/room/#freenode_#softwareheritage:matrix.org '''#softwareheritage''']: general discussions about the project (currently unused)&lt;br /&gt;
* [https://app.element.io/#/room/#freenode_#swh:matrix.org '''#swh''']: ditto, in case we end up preferring the short version&lt;br /&gt;
&lt;br /&gt;
If you use IRC, consider joining the channels.&lt;br /&gt;
&lt;br /&gt;
If you don't use IRC ''directly'', you can still join our chat channels from your web browser via a [https://matrix.org/ Matrix] bridge by clicking on the channel names in the list above. You will be asked to create a [https://element.io/ Element] account if you don't have one yet.&lt;br /&gt;
&lt;br /&gt;
== IRC authentication ==&lt;br /&gt;
&lt;br /&gt;
You should register their nick with NickServ using:&lt;br /&gt;
&lt;br /&gt;
 /nick &amp;lt;USERNAME&amp;gt;&lt;br /&gt;
 /msg nickserv register &amp;lt;PASSWORD&amp;gt; &amp;lt;EMAIL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will then receive an e-mail containing a link to activate you account. After doing so, you need to configure your client to auto-authenticate. The recommended way of doing that is using [https://freenode.net/kb/answer/sasl SASL authentication].&lt;br /&gt;
&lt;br /&gt;
For Weechat:&lt;br /&gt;
&lt;br /&gt;
 /set irc.server.freenode.sasl_username &amp;lt;USERNAME&amp;gt;&lt;br /&gt;
 /set irc.server.freenode.sasl_password &amp;lt;PASSWORD&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For matrix, the relevant docs is here: https://github.com/matrix-org/matrix-appservice-irc/wiki/End-user-FAQ#how-do-i-registeridentify-to-nickserv&lt;br /&gt;
&lt;br /&gt;
Freenode also supports authentication via [https://freenode.net/kb/answer/certfp TLS client certificates].&lt;br /&gt;
&lt;br /&gt;
=== Matrix bridge ===&lt;br /&gt;
&lt;br /&gt;
For the Matrix bridge ([https://github.com/matrix-org/matrix-appservice-irc/wiki/End-user-FAQ#how-do-i-registeridentify-to-nickserv relevant docs here]), you first need to choose you IRC nick by sending this to the bridge appservice:&lt;br /&gt;
&lt;br /&gt;
 /msg @appservice-irc:matrix.org !nick &amp;lt;USERNAME&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then send this command to NickServ to register your account:&lt;br /&gt;
&lt;br /&gt;
 /msg @freenode_NickServ:matrix.org register &amp;lt;PASSWORD&amp;gt; &amp;lt;EMAIL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will then receive an e-mail containing a link to activate you account. You can then identify using:&lt;br /&gt;
&lt;br /&gt;
 /msg @freenode_NickServ:matrix.org identify &amp;lt;PASSWORD&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then, ask the IRC bridge appservice to remember your password so that you get identified automatically:&lt;br /&gt;
&lt;br /&gt;
 /msg @appservice-irc:matrix.org !storepass &amp;lt;USERNAME&amp;gt;:&amp;lt;PASSWORD&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== IRC access list ==&lt;br /&gt;
&lt;br /&gt;
To auto-voice people with a registered nick (only doable by people with +fA access modes will be able to do it), add them to the channel access list:&lt;br /&gt;
&lt;br /&gt;
 /msg chanserv access #swh-devel add zack +V&lt;br /&gt;
&lt;br /&gt;
If you already have the right (+o ChanServ flag), you can make yourself an operator, with:&lt;br /&gt;
&lt;br /&gt;
 /msg chanserv OP #swh-devel&lt;br /&gt;
&lt;br /&gt;
[[Category:Infrastructure]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Code_review&amp;diff=1552</id>
		<title>Code review</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Code_review&amp;diff=1552"/>
		<updated>2021-03-31T12:44:49Z</updated>

		<summary type="html">&lt;p&gt;Vlorentz: Undo revision 1550 by Vlorentz (talk)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[swhdocs:devel/contributing/code-review.html#code-review]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Software development]]&lt;/div&gt;</summary>
		<author><name>Vlorentz</name></author>
	</entry>
</feed>