Difference between revisions of "Google Summer of Code 2020"

From Software Heritage Wiki
Jump to navigation Jump to search
(→‎Contact: freenode -> libera)
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[File:GSoCLogo.png|1024px]]
+
<div style="text-align: center; font-size: 1.2em; border: solid 1px black; padding: 1em;">
 +
Software Heritage is not participating in the Google Summer of Code 2020; this page is only kept as an archive. Thank you for your interest in Software Heritage, and you are welcome to apply for an [[internship]] instead.
 +
</div>
 +
 
  
 
== General information ==
 
== General information ==
Line 9: Line 12:
 
== Accepted projects ==
 
== Accepted projects ==
  
 
+
None yet, it's too soon for this!
  
 
== I want to participate as a student ==
 
== I want to participate as a student ==
Line 43: Line 46:
  
 
Below you can find a list of project ideas that are good options for a
 
Below you can find a list of project ideas that are good options for a
reasonably sized GSoC project.  They are just suggestion though, don't feel
+
reasonably-sized GSoC project:
obliged to pick one of them if there is nothing that fits your taste and
 
abilities.  Feel free to propose something else that you are excited about and
 
that contributes to improve the Software Heritage archive: we will be happy to
 
consider it!
 
 
 
=== Mine information from archived content ===
 
 
 
In addition to archival, Software Heritage indexes the retrieved source code
 
artifacts, to enable semantic searches on the archive and scientific research.
 
 
 
Indexing can happen at the individual file-level (e.g., detect the programming
 
language the file is written in or the license declared in its header), or at
 
more coarse grained granularity (e.g., what metadata are declared for the most
 
recently archived version of a given project).
 
 
 
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],
 
such as:
 
 
 
* file level mining:
 
** MIME type detection (using libmagic)
 
** license detection (using FOSSology/nomossa)
 
** language detection (using Pygments)
 
** ctags extraction (using universal-ctags)
 
* project level mining:
 
** Ruby gemspec metadata
 
** Python PKG-INFO metadata
 
** Maven pom.xml metadata
 
** NPM package.json metadata
 
 
 
Writing additional indexers that extract more information from archived source
 
code is welcome and would constitute a suitable GSoC project.
 
 
 
Name the kind of data mining you want to do!
 
 
 
For inspiration you can have a look at [https://libraries.io Libraries.io], as
 
most package formats/package managers support dedicated ways of expressing
 
metadata and we only support a small number of them up-to-now. But do not
 
restrict your ambition to those, any kind of data extraction/mining you want to
 
do on the archive could work.
 
 
 
You may also add support for multiple formats at once, using an external tool,
 
such as [https://github.com/datacite/bolognese Bolognese] or
 
[https://github.com/librariesio/bibliothecary/ bibliothecary].
 
 
 
=== Mine information from external sources ===
 
 
 
In addition to arching source code artifacts, Software Heritage is interested in
 
archive metadata from external sources and correlate it to source code artifacts.
 
This is also to enable semantic searches on the archive and scientific research.
 
 
 
Collecting this extrinsic metadata is a
 
[https://forge.softwareheritage.org/T1739 work in progress], and you are welcome
 
to contribute to its implementation.
 
 
 
 
 
=== Improve and extend the archive Web UI ===
 
 
 
As you probably know already, The Software Heritage archive can be
 
[https://archive.softwareheritage.org browsed on the Web]. The
 
[https://forge.softwareheritage.org/source/swh-web/ code] powering that
 
interface is a Django application that also implements a
 
[https://archive.softwareheritage.org/api/ Web API].
 
 
 
Several improvements are possible on the archive Web interface and would make
 
great GSoC projects, some ideas to whet your appetite:
 
 
 
* add new source code search criteria and improve the search interface
 
* add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ... (note that this will also require backend design and implementation)
 
* improve [https://www.w3.org/WAI/ accessibility]
 
* help us design and implement our [https://forge.softwareheritage.org/T1805 next API version]
 
 
 
=== Improve the Vault ===
 
 
 
The Software Heritage archive allows retrieval of archived objects of different formats.
 
Once an object has been chosen for retrieval, it can be "cooked" using the [https://docs.softwareheritage.org/devel/swh-vault/index.html Software Heritage Vault].
 
 
 
Right now the Vault has several limitations: it only handles two kinds of objects (revisions and directories), it requires recursively requesting the database to get the full subgraph of an object, and it generates revisions in an unpractical format (git fast-import).
 
 
 
Several improvements are possible:
 
 
 
* add coverage for new kinds of objects (releases, snapshots and even origins?)
 
* use our in-memory graph database [https://docs.softwareheritage.org/devel/swh-graph/index.html swh-graph] to speed up fetching the necessary subgraphs.
 
* write cookers to output new formats (e.g git tarballs/git bundles or even other VCS?)
 
* improve end-to-end testing
 
* other general code improvements (better progress/error reporting in the frontend, etc.)
 
  
=== Internship topics ===
+
* [[Improve and extend the archive Web UI (GSoC task)]]
 +
* [[Improve the Vault (GSoC task)]]
 +
* [[Mine information from archived content (GSoC task)]]
 +
* [[Mine information from external sources (GSoC task)]]
  
 
Independently from GSoC, we also maintain a separate list of academic [[Internships]].
 
Independently from GSoC, we also maintain a separate list of academic [[Internships]].
 
They are usually offered to university students, but during GSoC they are also available as GSoC projects.
 
They are usually offered to university students, but during GSoC they are also available as GSoC projects.
 +
The currently available (GSoC) internship topics are:
  
The currently available (GSoC) internship topics are:
+
* [[Archive search query language (internship)]]
<DynamicPageList>
+
* [[Expand metadata search coverage (internship)]]
category = Available internship
+
* [[Fine-grained tracking of source code provenance (internship)]]
</DynamicPageList>
+
* [[Graph query language for the archive (internship)]]
 +
* [[Ingest all Debian derivatives (internship)]]
 +
* [[Ingest Wikidata software origins (internship)]]
 +
* [[Integrate Software Heritage and GHTorrent (internship)]]
 +
* [[Large-scale license text recognition (internship)]]
 +
* [[Source code search engine prototype (internship)]]
 +
 
 +
Both GSoC tasks and internship topics are just suggestion though, don't feel
 +
obliged to pick one of them if there is nothing that fits your taste and
 +
abilities.  Feel free to propose something else that you are excited about and
 +
that contributes to improve the Software Heritage archive: we will be happy to
 +
consider it!
  
 
== Contact ==
 
== Contact ==
  
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, i.e.:
+
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our IRC channel (#swh-devel on [https://libera.chat/ Libera Chat]) and mailing list (swh-devel).
 
 
* the #swh-devel IRC channel on [https://freenode.net Freenode]
 
* the [https://sympa.inria.fr/sympa/info/swh-devel swh-devel mailing list]
 
  
See our [https://www.softwareheritage.org/community/developers/ development information page] for more details.
+
See our [https://www.softwareheritage.org/community/developers/ development information page] for details.
  
 
== Timeline ==
 
== Timeline ==

Latest revision as of 07:30, 15 June 2021

Software Heritage is not participating in the Google Summer of Code 2020; this page is only kept as an archive. Thank you for your interest in Software Heritage, and you are welcome to apply for an internship instead.


General information

This page is the central point of information for Software Heritage participation into the Google Summer of Code program in 2020.

Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.

Accepted projects

None yet, it's too soon for this!

I want to participate as a student

Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.

Prerequisites

The following prerequisites apply to Software Heritage GSoC projects:

  • Python 3 is our language of choice, you should be fluent with that language to apply
  • Git is our version control system of choice, you should be familiar with it to apply
  • additional prerequisites depend on the project you will work on; check project descriptions for details

Before you apply

Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:

  1. Follow our getting started guide: it will make sure you can locally run a (small) copy of the archive and ingest source code into it
  2. Create an account on our development forge
  3. Familiarize yourself with our code review workflow
  4. Make at least one simple change to any one of our software components and submit it as a diff for code review, following the above workflow. Easy hacks and Web UI issues are good options for what to fix, but feel free to submit any patch you think it might be useful.

What to include in your application

Make sure that your application includes the following information:

  • Describe the specific project you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!
  • Detail your work plan: a brief description of how you plan to go about your project, including a list of deliverables and a timeline of when do you expect them to be available.
  • Include a reference to the diff you submitted before applying (see the "Before you apply" section above).

Ideas list

Below you can find a list of project ideas that are good options for a reasonably-sized GSoC project:

Independently from GSoC, we also maintain a separate list of academic Internships. They are usually offered to university students, but during GSoC they are also available as GSoC projects. The currently available (GSoC) internship topics are:

Both GSoC tasks and internship topics are just suggestion though, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities. Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!

Contact

GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, and in particular our IRC channel (#swh-devel on Libera Chat) and mailing list (swh-devel).

See our development information page for details.

Timeline

See the official Google Summer of Code timeline.