Difference between revisions of "Google Summer of Code 2019"

From Software Heritage Wiki
Jump to navigation Jump to search
(→‎Ideas list: rereading)
(→‎Contact: freenode -> libera)
 
(23 intermediate revisions by 4 users not shown)
Line 3: Line 3:
 
== General information ==
 
== General information ==
  
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program.
+
This page is the central point of information for [[Software Heritage]] participation into the [https://summerofcode.withgoogle.com/ Google Summer of Code] program in 2019.
  
 
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.
 
Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.
 +
 +
== Accepted projects ==
 +
 +
* [[Google Summer of Code 2019/Graph compression|Graph compression on the development history of software]] - Thibault Allançon
 +
* [[Google Summer of Code 2019/Web UI improvements|Web UI improvements]] - Kalpit Kothari
 +
* [[Google Summer of Code 2019/Increase archive coverage|Increase archive coverage]] - Archit Agrawal
  
 
== I want to participate as a student ==
 
== I want to participate as a student ==
Line 24: Line 30:
  
 
# Follow our [https://docs.softwareheritage.org/devel/getting-started.html getting started guide]: it will make sure you can locally run a (small) copy of the archive and ingest source code into it
 
# Follow our [https://docs.softwareheritage.org/devel/getting-started.html getting started guide]: it will make sure you can locally run a (small) copy of the archive and ingest source code into it
# Create an account our [https://forge.softwareheritage.org development forge]
+
# Create an account on our [https://forge.softwareheritage.org development forge]
 
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]
 
# Familiarize yourself with our [[Code review in Phabricator|code review workflow]]
 
# Make a simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.
 
# Make a simple change to any one of our [https://docs.softwareheritage.org/devel/ software components] and submit it as a [https://forge.softwareheritage.org/differential/ diff] for code review, following the above workflow. [[Easy hacks]] and [https://forge.softwareheritage.org/project/view/20/ Web UI] issues are good options for what to fix, but feel free to submit any patch you think it might be useful.
Line 32: Line 38:
 
Make sure that your application includes the following information:
 
Make sure that your application includes the following information:
  
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!
+
* Describe the '''specific project''' you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!
 
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.
 
* Detail your '''work plan''': a brief description of how you plan to go about your project, including a list of  ''deliverables'' and a ''timeline'' of when do you expect them to be available.
 
* Include a reference to '''the diff''' you submitted before applying (see the "Before you apply" section above).
 
* Include a reference to '''the diff''' you submitted before applying (see the "Before you apply" section above).
Line 53: Line 59:
 
As Software Heritage grows, we're incrementally increasing archive coverage by
 
As Software Heritage grows, we're incrementally increasing archive coverage by
 
expanding the sources from which we archive software; a list of currently
 
expanding the sources from which we archive software; a list of currently
crawled sources is listed on the [https://archive.softwareheritage.org main
+
crawled sources is listed on the
archive page]. As you can see there, we have already built ways of archiving
+
[https://archive.softwareheritage.org main archive page]. As you can see there,
Mercurial repositories, Debian packages, PyPI bundles, etc.
+
we have already built ways of archiving Mercurial repositories, Debian packages,
 +
PyPI bundles, and more.
  
 
Further expansions of archive coverage are very suitable GSoC project.
 
Further expansions of archive coverage are very suitable GSoC project.
  
 
Practically, to expand archive coverage two kinds of software components need
 
Practically, to expand archive coverage two kinds of software components need
to be implemented:
+
to be implemented: listers and loaders.
 +
 
 +
'''Listers''' are components that crawl the APIs of software
 +
[https://en.wikipedia.org/wiki/Forge_(software) forges] (e.g., Bitbucket,
 +
Gitorious, Sourceforge, ...) or package managers (a larges list is maintained
 +
by [https://libraries.io Libraries.io]) and return a list of the software
 +
available in it. See the official
 +
[https://docs.softwareheritage.org/devel/swh-lister/ listers documentation] for
 +
more details.
  
# Origin '''listers''': pieces of code that crawl the APIs of software [https://en.wikipedia.org/wiki/Forge_(software) forges] (e.g., Bitbucket, Gitorious, Sourceforge, ...) or package managers (a larges list is maintained by [https://libraries.io Libraries.io]) and return a list of the software available in it. The documentation on listers is here: https://docs.softwareheritage.org/devel/swh-lister/index.html
+
'''Loaders''' take a bundle of software (tarball, Git repository, Python
# Create '''loaders'''. Loaders take a bundle of software (tarball, git repository, Python package, ...) and load it into Software Heritage, by adapting it so that it matches our uniform data model[https://docs.softwareheritage.org/devel/swh-model/data-model.html].
+
package, ...) and load it into Software Heritage, by adapting it so that it
 +
matches the archive
 +
[https://docs.softwareheritage.org/devel/swh-model/data-model.html data model].
  
In a few words, a lister can be a way of asking "what are all the repositories available on npm.org?", while a loader would be "how do I load the NPM package I downloaded into Software Heritage?".
+
While listers answer questions like "what are all the repositories available on
 +
npm.org?", a loader addresses the "how do I load the NPM package I downloaded
 +
into Software Heritage?" problem.
  
Writing a lister or a loader is a great way to contribute to Software Heritage by expanding its coverage! We have a list of software sources we would like to archive here[https://wiki.softwareheritage.org/wiki/Suggestion_box:_source_code_to_add], but you're free to suggest more.
+
Writing a missing lister or a loader is a great way to contribute to expand the
 +
coverage of the Software Heritage archive! Feel free to propose the
 +
implementation of one (or several!) listers or loaders that are currently
 +
missing. For inspiration you can check out our [[Suggestion box]] for code to
 +
archive, or propose your favorite missing forge or package repository.
  
 
=== Mine information from archived content ===
 
=== Mine information from archived content ===
  
'''TODO'''
+
In addition to archival, Software Heritage indexes the retrieved source code
 +
artifacts, to enable semantic searches on the archive and scientific research.
 +
 
 +
Indexing can happen at the individual file-level (e.g., detect the programming
 +
language the file is written in or the license declared in its header), or at
 +
more coarse grained granularity (e.g., what metadata are declared for the most
 +
recently archived version of a given project).
 +
 
 +
A number of indexes are [https://forge.softwareheritage.org/source/swh-indexer/ currently supported],
 +
such as:
 +
 
 +
* file level mining:
 +
** MIME type detection (using libmagic)
 +
** license detection (using FOSSology/nomossa)
 +
** language detection (using Pygments)
 +
** ctags extraction (using universal-ctags)
 +
* project level mining:
 +
** Ruby gemspec metadata
 +
** Python PKG-INFO metadata
 +
** Maven pom.xml metadata
 +
** NPM package.json metadata
 +
 
 +
Writing additional indexers that extract more information from archived source
 +
code is welcome and would constitute a suitable GSoC project.
 +
 
 +
Name the kind of data mining you want to do!
 +
 
 +
For inspiration you can have a look at [https://libraries.io Libraries.io], as
 +
most package formats/package managers support dedicated ways of expressing
 +
metadata and we only support a small number of them up-to-now. But do not
 +
restrict your ambition to those, any kind of data extraction/mining you want to
 +
do on the archive could work.
  
 
=== Improve and extend the archive Web UI ===
 
=== Improve and extend the archive Web UI ===
  
In order to easily navigate into the archive content, a [https://forge.softwareheritage.org/source/swh-web/ web application] is currently developed.
+
As you probably know already, The Software Heritage archive can be
So far it offers the following main features:
+
[https://archive.softwareheritage.org browsed on the Web]. The
* programmatic access to the content of the archive via the [https://archive.softwareheritage.org/api/ Software Heritage API]
+
[https://forge.softwareheritage.org/source/swh-web/ code] powering that
* in-browser navigation of the content of the archive via the [https://archive.softwareheritage.org/browse/ Software Heritage browse UI]
+
interface is a Django application that also implements a
 +
[https://archive.softwareheritage.org/api/ Web API].
 +
 
 +
Several improvements are possible on the archive Web interface and would make
 +
great GSoC projects, some ideas to whet your appetite:
  
There are still numerous improvements and new features to add to that web application, for instance:
+
* improve navigation on mobile devices and browsers
* add new API endpoints
 
* improve overall design
 
* improve navigation for mobile browsers
 
 
* add new source code search criteria and improve the search interface
 
* add new source code search criteria and improve the search interface
* implement new developer oriented features: source file history, blame interface, ...
+
* add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ...
* improve web application [https://www.w3.org/WAI/ accessibility]
+
* improve [https://www.w3.org/WAI/ accessibility]
 +
* add missing API endpoints (name your pet peeves!)
 +
* add end to end tests using [https://www.seleniumhq.org/ Selenium]
 +
 
 +
=== Research internships ===
  
If you are interested in web development and want to contribute to the application
+
For the more research-inclined students, we also maintain a separate list of [[Internships]].
enabling users to navigate in the biggest public source code archive collected so far,
+
Any topic there is also a viable GSoC project.
feel free to apply.
 
  
 
== Contact ==
 
== Contact ==
Line 96: Line 154:
 
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, i.e.:
 
GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, i.e.:
  
* the #swh-devel IRC channel on [https://freenode.net Freenode]
+
* the #swh-devel IRC channel on [https://libera.chat/ Libera Chat]
 
* the [https://sympa.inria.fr/sympa/info/swh-devel swh-devel mailing list]
 
* the [https://sympa.inria.fr/sympa/info/swh-devel swh-devel mailing list]
  
Line 104: Line 162:
  
 
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].
 
See the official [https://developers.google.com/open-source/gsoc/timeline Google Summer of Code timeline].
 +
 +
 +
[[Category:Google Summer of Code]]
 +
[[Category:Google Summer of Code 2019]]

Latest revision as of 07:29, 15 June 2021

GSoCLogo.png

General information

This page is the central point of information for Software Heritage participation into the Google Summer of Code program in 2019.

Google Summer of Code is a program where Google pays students stipends to work over the (northern hemisphere) summer on free software projects such as Software Heritage. Each student works with mentors from the community to complete a software project.

Accepted projects

I want to participate as a student

Great!, we are very glad for your interest in contributing to Software Heritage and we are looking forward to work together.

Prerequisites

The following prerequisites apply to Software Heritage GSoC projects:

  • Python 3 is our language of choice, you should be fluent with that language to apply
  • Git is our version control system of choice, you should be familiar with it to apply
  • additional prerequisites depend on the project you will work on; check project descriptions for details

Before you apply

Here are the steps you should follow before applying, to make sure you have a good grasp of what we are doing at Software Heritage and how we do it:

  1. Follow our getting started guide: it will make sure you can locally run a (small) copy of the archive and ingest source code into it
  2. Create an account on our development forge
  3. Familiarize yourself with our code review workflow
  4. Make a simple change to any one of our software components and submit it as a diff for code review, following the above workflow. Easy hacks and Web UI issues are good options for what to fix, but feel free to submit any patch you think it might be useful.

What to include in your application

Make sure that your application includes the following information:

  • Describe the specific project you want to work on. What do you want to achieve? Why is it important? Why is it useful for Software Heritage? The project might be one of the project ideas that we have prepared below, or something else entirely that you want to contribute to Software Heritage. Your source code archival pet peeve, surprise us!
  • Detail your work plan: a brief description of how you plan to go about your project, including a list of deliverables and a timeline of when do you expect them to be available.
  • Include a reference to the diff you submitted before applying (see the "Before you apply" section above).

Ideas list

Below you can find a list of project ideas that are good options for a reasonably sized GSoC project. They are just suggestion though, don't feel obliged to pick one of them if there is nothing that fits your taste and abilities. Feel free to propose something else that you are excited about and that contributes to improve the Software Heritage archive: we will be happy to consider it!

Increase archive coverage

Software Heritage aims to archive all publicly available software source code. We naturally started with the place where most of the software is easily available today: git repositories on GitHub.

As Software Heritage grows, we're incrementally increasing archive coverage by expanding the sources from which we archive software; a list of currently crawled sources is listed on the main archive page. As you can see there, we have already built ways of archiving Mercurial repositories, Debian packages, PyPI bundles, and more.

Further expansions of archive coverage are very suitable GSoC project.

Practically, to expand archive coverage two kinds of software components need to be implemented: listers and loaders.

Listers are components that crawl the APIs of software forges (e.g., Bitbucket, Gitorious, Sourceforge, ...) or package managers (a larges list is maintained by Libraries.io) and return a list of the software available in it. See the official listers documentation for more details.

Loaders take a bundle of software (tarball, Git repository, Python package, ...) and load it into Software Heritage, by adapting it so that it matches the archive data model.

While listers answer questions like "what are all the repositories available on npm.org?", a loader addresses the "how do I load the NPM package I downloaded into Software Heritage?" problem.

Writing a missing lister or a loader is a great way to contribute to expand the coverage of the Software Heritage archive! Feel free to propose the implementation of one (or several!) listers or loaders that are currently missing. For inspiration you can check out our Suggestion box for code to archive, or propose your favorite missing forge or package repository.

Mine information from archived content

In addition to archival, Software Heritage indexes the retrieved source code artifacts, to enable semantic searches on the archive and scientific research.

Indexing can happen at the individual file-level (e.g., detect the programming language the file is written in or the license declared in its header), or at more coarse grained granularity (e.g., what metadata are declared for the most recently archived version of a given project).

A number of indexes are currently supported, such as:

  • file level mining:
    • MIME type detection (using libmagic)
    • license detection (using FOSSology/nomossa)
    • language detection (using Pygments)
    • ctags extraction (using universal-ctags)
  • project level mining:
    • Ruby gemspec metadata
    • Python PKG-INFO metadata
    • Maven pom.xml metadata
    • NPM package.json metadata

Writing additional indexers that extract more information from archived source code is welcome and would constitute a suitable GSoC project.

Name the kind of data mining you want to do!

For inspiration you can have a look at Libraries.io, as most package formats/package managers support dedicated ways of expressing metadata and we only support a small number of them up-to-now. But do not restrict your ambition to those, any kind of data extraction/mining you want to do on the archive could work.

Improve and extend the archive Web UI

As you probably know already, The Software Heritage archive can be browsed on the Web. The code powering that interface is a Django application that also implements a Web API.

Several improvements are possible on the archive Web interface and would make great GSoC projects, some ideas to whet your appetite:

  • improve navigation on mobile devices and browsers
  • add new source code search criteria and improve the search interface
  • add developer-oriented features, e.g., source file history, blame/praise interface, in-browser edit (with patch download), ...
  • improve accessibility
  • add missing API endpoints (name your pet peeves!)
  • add end to end tests using Selenium

Research internships

For the more research-inclined students, we also maintain a separate list of Internships. Any topic there is also a viable GSoC project.

Contact

GSoC students are encouraged to get in touch with the Software Heritage community using the standard development communication channels, i.e.:

See our development information page for more details.

Timeline

See the official Google Summer of Code timeline.