<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=KShivendu</id>
	<title>Software Heritage Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=KShivendu"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/wiki/Special:Contributions/KShivendu"/>
	<updated>2026-04-21T03:39:04Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1710</id>
		<title>Add sources to the project search engine (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Add_sources_to_the_project_search_engine_(GSoC_task)&amp;diff=1710"/>
		<updated>2022-04-02T06:22:47Z</updated>

		<summary type="html">&lt;p&gt;KShivendu: Add expected duration section&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The [https://archive.softwareheritage.org/ homepage of the Software Heritage archive]&lt;br /&gt;
features a small search engine, that searched in project URLs and project metadata.&lt;br /&gt;
Project metadata includes name, description, authors, etc.&lt;br /&gt;
&lt;br /&gt;
This is implemented by a Python service backed by an ElasticSearch database,&lt;br /&gt;
which contains one document for each project; each document containing metadata&lt;br /&gt;
mined from the project itself&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
We would like to add more data sources to the ElasticSearch database;&lt;br /&gt;
typically sources that are not authoritative, but provide metadata of usually&lt;br /&gt;
good quality.&lt;br /&gt;
&lt;br /&gt;
This comes with the following challenges:&lt;br /&gt;
&lt;br /&gt;
# there are multiple sources, and their contents must work together&lt;br /&gt;
# sources have different reliability, that should be taken into account when ranking search results&lt;br /&gt;
&lt;br /&gt;
Therefore, this task will require making a plan to address these,&lt;br /&gt;
define a data model, and finally implement it in a backend.&lt;br /&gt;
It may involve some frontend work if necessary, to provide an interface for&lt;br /&gt;
these.&lt;br /&gt;
&lt;br /&gt;
== Expected duration ==&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can handle more data sources). Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* ElasticSearch&lt;br /&gt;
* Experience with cross-referenced data mining would be appreciated&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Vincent Sellier (vsellier on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
== Other relevant (but independent) tasks ==&lt;br /&gt;
&lt;br /&gt;
This task is only about adding data we already collected the existing Elasticsearch database;&lt;br /&gt;
you may also be interested in [[Mine information from archived content (GSoC task)]]&lt;br /&gt;
and [[Mine information from external sources (GSoC task)]] to fill this&lt;br /&gt;
database; but those are completely independent tasks.&lt;br /&gt;
&lt;br /&gt;
This database only contains project URLs and metadata, not source code.&lt;br /&gt;
Source code search is more complex, but is available as an&lt;br /&gt;
[[Source code search engine prototype (internship)|internship topic]]&lt;br /&gt;
&lt;br /&gt;
[[Category:GSoC task]]&lt;br /&gt;
[[Category:Available GSoC task]]&lt;/div&gt;</summary>
		<author><name>KShivendu</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1705</id>
		<title>Mine information from external sources (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_external_sources_(GSoC_task)&amp;diff=1705"/>
		<updated>2022-03-11T06:42:33Z</updated>

		<summary type="html">&lt;p&gt;KShivendu: Add KShivendu&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archiving source code artifacts, Software Heritage is interested in&lt;br /&gt;
archive metadata from external sources and correlate it to source code artifacts.&lt;br /&gt;
This is also to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Collecting this extrinsic metadata is a&lt;br /&gt;
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome&lt;br /&gt;
to contribute to its implementation.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
You would contribute to the design of our metadata-fetching architecture.&lt;br /&gt;
This includes:&lt;br /&gt;
&lt;br /&gt;
* Review what metadata we want to fetch&lt;br /&gt;
* How to efficiently fetch it at regular intervals and store it&lt;br /&gt;
* Implement metadata fetching from at least one source, in a way that can be generalized to other sources&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can tackle more formats).&lt;br /&gt;
Difficulty: hard&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with software metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>KShivendu</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1704</id>
		<title>Mine information from archived content (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Mine_information_from_archived_content_(GSoC_task)&amp;diff=1704"/>
		<updated>2022-03-11T06:42:08Z</updated>

		<summary type="html">&lt;p&gt;KShivendu: Add KShivendu&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
In addition to archival, Software Heritage indexes the retrieved source code&lt;br /&gt;
artifacts, to enable semantic searches on the archive and scientific research.&lt;br /&gt;
&lt;br /&gt;
Indexing can happen at the individual file-level (e.g., detect the programming&lt;br /&gt;
language the file is written in or the license declared in its header), or at&lt;br /&gt;
more coarse grained granularity (e.g., what metadata are declared for the most&lt;br /&gt;
recently archived version of a given project).&lt;br /&gt;
&lt;br /&gt;
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],&lt;br /&gt;
such as:&lt;br /&gt;
&lt;br /&gt;
* file level mining:&lt;br /&gt;
** MIME type detection (using libmagic)&lt;br /&gt;
** license detection (using FOSSology/nomossa)&lt;br /&gt;
** language detection (using Pygments)&lt;br /&gt;
** ctags extraction (using universal-ctags)&lt;br /&gt;
* project level mining:&lt;br /&gt;
** Ruby gemspec metadata&lt;br /&gt;
** Python PKG-INFO metadata&lt;br /&gt;
** Maven pom.xml metadata&lt;br /&gt;
** NPM package.json metadata&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
Writing additional indexers that extract more information from archived source&lt;br /&gt;
code is welcome and would constitute a suitable GSoC project.&lt;br /&gt;
&lt;br /&gt;
Name the kind of data mining you want to do!&lt;br /&gt;
&lt;br /&gt;
For inspiration you can have a look at [https://libraries.io Libraries.io], as&lt;br /&gt;
most package formats/package managers support dedicated ways of expressing&lt;br /&gt;
metadata and we only support a small number of them up-to-now. But do not&lt;br /&gt;
restrict your ambition to those, any kind of data extraction/mining you want to&lt;br /&gt;
do on the archive could work.&lt;br /&gt;
&lt;br /&gt;
You may also add support for multiple formats at once, using an external tool,&lt;br /&gt;
such as [https://github.com/datacite/bolognese Bolognese] or&lt;br /&gt;
[https://github.com/librariesio/bibliothecary/ bibliothecary].&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
175 or 350 hours, at your option (longer duration means you can tackle more formats). Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Prior experience in working with (source code) metadata is a plus, but not required&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Valentin Lorentz (vlorentz on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>KShivendu</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1703</id>
		<title>Dashboard UI for the Code Scanner (GSoC task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Dashboard_UI_for_the_Code_Scanner_(GSoC_task)&amp;diff=1703"/>
		<updated>2022-03-11T06:41:55Z</updated>

		<summary type="html">&lt;p&gt;KShivendu: Add KShivendu&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Software Heritage archive is the most comprehensive open data knowledge base about source code that has been published openly.&lt;br /&gt;
&lt;br /&gt;
As such, it can be used to scan local source code bases to detect which parts of it come from public code, including Free and Open Source Software.&lt;br /&gt;
&lt;br /&gt;
The Software Heritage scanner (&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt;) ([https://docs.softwareheritage.org/devel/swh-scanner/ documentation], [https://forge.softwareheritage.org/source/swh-scanner/ code], [https://upsilon.cc/~zack/talks/2021/2021-04-07-llw.pdf slides of a 2021 presentation about swh-scanner]) is a command line tool that enables doing that.&lt;br /&gt;
&lt;br /&gt;
== Task description ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;swh-scanner&amp;lt;/code&amp;gt; is currently an experimental tool, which works well in practice, but need a real '''dashboard user interface''' to be useful.&lt;br /&gt;
Several output options are currently available when invoking the &amp;lt;code&amp;gt;swh scanner scan&amp;lt;/code&amp;gt; command, in particular batch output in textual and JSON format, and an interactive &lt;br /&gt;
dashboard (with the &amp;lt;code&amp;gt;-i/--interactive&amp;lt;/code&amp;gt;) option.&lt;br /&gt;
&lt;br /&gt;
The interactive view currently works by producing a local HTML file and opening it using the local browser.&lt;br /&gt;
The goal of this project is to '''improve the interactive view''', making it a serious dashboard-style UI to peruse scanning results.&lt;br /&gt;
&lt;br /&gt;
The following improvements are suggested, although more can be proposed (and even more could be discovered during the project work):&lt;br /&gt;
&lt;br /&gt;
* Technology: generating a local HTML file is not necessarily the best way to render results, alternative solutions should be explored, including a self-hosted web app, rendering results with state-of-the-art frontend web frameworks (css/html/javascript)&lt;br /&gt;
* Scalability: currently rendering doesn't work when scanning large code bases such as the Linux kernel, rendering should be made lazy, by only loading data to show when needed&lt;br /&gt;
* Functionality: dashboard rendering should be integrated with the possibility of opening the local source code files that have been scanned, e.g., users will want to be able to open in-browser files that have been detected as known/unknown, in order to figure why&lt;br /&gt;
* Functionality: in the future additional information will be added to scanning results, including license and provenance information. While not yet available right now due to backend limitations, the proposed UI should plan ahead about how/where to display such information&lt;br /&gt;
* Paper cuts: [https://forge.softwareheritage.org/tag/code_scanner/ various issues] affect the usability of swh-scanner, improving them would be welcome as part of this project&lt;br /&gt;
&lt;br /&gt;
=== Expected duration ===&lt;br /&gt;
&lt;br /&gt;
350 hours. Difficulty: medium&lt;br /&gt;
&lt;br /&gt;
== Desirable skills ==&lt;br /&gt;
&lt;br /&gt;
* Python 3 and Git are a must to work on any Software Heritage project&lt;br /&gt;
* Basic understanding of the Software Heritage [https://docs.softwareheritage.org/devel/swh-model/data-model.html data model] and of [https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html SWHID identifiers]&lt;br /&gt;
* HTML/CSS/JavaScript and web development in general&lt;br /&gt;
* Working knowledge of UI/UX design principles&lt;br /&gt;
&lt;br /&gt;
== Potential mentors ==&lt;br /&gt;
&lt;br /&gt;
* Stefano Zacchiroli &amp;lt;zack@upsilon.cc&amp;gt; (zack on [[IRC]])&lt;br /&gt;
* Kumar Shivendu (KShivendu on [[IRC]])&lt;br /&gt;
&lt;br /&gt;
[[Category:Available GSoC task]]&lt;br /&gt;
[[Category:GSoC task]]&lt;/div&gt;</summary>
		<author><name>KShivendu</name></author>
	</entry>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1697</id>
		<title>Google Summer of Code 2022</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Google_Summer_of_Code_2022&amp;diff=1697"/>
		<updated>2022-03-09T06:20:02Z</updated>

		<summary type="html">&lt;p&gt;KShivendu: Make it clear that we are selected for GSoC 2022&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:GSoCLogo.png|512px]]&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
&lt;br /&gt;
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2022.&lt;br /&gt;
&lt;br /&gt;
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.&lt;br /&gt;
&lt;br /&gt;
== I want to participate as a student ==&lt;br /&gt;
&lt;br /&gt;
Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.&lt;br /&gt;
&lt;br /&gt;
=== Prerequisites ===&lt;br /&gt;
&lt;br /&gt;
The following prerequisites apply to all Software Heritage GSoC projects:&lt;br /&gt;
&lt;br /&gt;
* [https://www.python.org Python 3] is our language of choice, you should be fluent with that language to apply&lt;br /&gt;
* [https://git-scm.com Git] is our version control system of choice, you should be familiar with it to apply&lt;br /&gt;
* basic knowledge in using a CLI&lt;br /&gt;
* additional prerequisites depend on the project you will work on; check project descriptions for details&lt;br /&gt;
&lt;br /&gt;
=== Before you apply ===&lt;br /&gt;
&lt;br /&gt;
Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:&lt;br /&gt;
&lt;br /&gt;
# Follow our [https://docs.softwareheritage.org/devel/developer-setup.html developer setup tutorial]: it will make sure you have the source code of our software stack locally available and that you can run unit tests&lt;br /&gt;
# Create an account on our [https://forge.softwareheritage.org development forge]&lt;br /&gt;
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]&lt;br /&gt;
# Make at least one simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.&lt;br /&gt;
&lt;br /&gt;
=== What to include in your application ===&lt;br /&gt;
&lt;br /&gt;
Make sure that your application includes the following information:&lt;br /&gt;
&lt;br /&gt;
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!&lt;br /&gt;
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.&lt;br /&gt;
* Include a reference to '''the diff''' you submitted before applying (see the &amp;quot;Before you apply&amp;quot; section above).&lt;br /&gt;
&lt;br /&gt;
== Ideas list ==&lt;br /&gt;
&lt;br /&gt;
Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project (check individual idea pages for expected duration and difficulty of each task):&lt;br /&gt;
&amp;lt;DynamicPageList&amp;gt;&lt;br /&gt;
category = Available GSoC task&lt;br /&gt;
ordermethod = sortkey&lt;br /&gt;
order = ascending&lt;br /&gt;
&amp;lt;/DynamicPageList&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We also have a [[Internships|list of internship topics]], which you can use for ideas when applying to GSoC with us.&lt;br /&gt;
Expect each internship topic to require 350 hours and to be on the harder side than GSoC-specific tasks.&lt;br /&gt;
&lt;br /&gt;
All project ideas above are just suggestions, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities.&lt;br /&gt;
Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!&lt;br /&gt;
&lt;br /&gt;
== Contact ==&lt;br /&gt;
&lt;br /&gt;
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our [[IRC]] channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).&lt;br /&gt;
&lt;br /&gt;
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.&lt;br /&gt;
&lt;br /&gt;
== Timeline ==&lt;br /&gt;
&lt;br /&gt;
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Google Summer of Code]]&lt;br /&gt;
[[Category:Google Summer of Code 2022]]&lt;/div&gt;</summary>
		<author><name>KShivendu</name></author>
	</entry>
</feed>