Difference between revisions of "Improving the scheduler (GSoC task)"

From Software Heritage Wiki
Jump to: navigation, search
(Created page with "== Introduction == Software Heritage periodically crawls over 150 millions software origins. This is a challenge as our loading infrastructure is not infinite, and loading VC...")
 
(Potential mentors)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
Software Heritage periodically crawls over 150 millions software origins.
+
Software Heritage periodically crawls over [https://archive.softwareheritage.org/ 150 millions software origins].
 
This is a challenge as our loading infrastructure is not infinite, and loading
 
This is a challenge as our loading infrastructure is not infinite, and loading
 
VCS history is a somewhat computationally expensive task.
 
VCS history is a somewhat computationally expensive task.
  
The scheduler is in charge of scheduling loading tasks regularly,
+
The swh-scheduler component (see: [https://forge.softwareheritage.org/source/swh-scheduler/ code] and [https://docs.softwareheritage.org/devel/swh-scheduler/index.html doc]) is in charge of scheduling loading tasks regularly,
 
and tries to be smart about the intervals, to avoid unnecessary work.
 
and tries to be smart about the intervals, to avoid unnecessary work.
 
For example, by visiting active origins more often than inactive ones.
 
For example, by visiting active origins more often than inactive ones.
Line 25: Line 25:
 
* Python 3 and Git are a must to work on any Software Heritage project
 
* Python 3 and Git are a must to work on any Software Heritage project
 
* PostgreSQL, as the scheduler heavily relies on it to be efficient
 
* PostgreSQL, as the scheduler heavily relies on it to be efficient
* Experience with scientific simulation and/or agent models are a plus, but not required
+
* Experience with scientific simulation is a plus, but not required
  
 
== Potential mentors ==
 
== Potential mentors ==
  
* Nicolas Dandrimont
+
* Nicolas Dandrimont (olasd on [[IRC]])
* Valentin Lorentz
+
* Valentin Lorentz (vlorentz on [[IRC]])
  
 
[[Category:GSoC task]]
 
[[Category:GSoC task]]

Latest revision as of 10:33, 2 March 2021

Introduction

Software Heritage periodically crawls over 150 millions software origins. This is a challenge as our loading infrastructure is not infinite, and loading VCS history is a somewhat computationally expensive task.

The swh-scheduler component (see: code and doc) is in charge of scheduling loading tasks regularly, and tries to be smart about the intervals, to avoid unnecessary work. For example, by visiting active origins more often than inactive ones.

Task description

You will improve the task scheduler to make smarter choices about what origins to visit and when.

We recently redesigned the scheduler to allow "pluggable" scheduling policies, so you can write new ones and see how they compare to the simple ones we already wrote.

We also wrote a simulator to try out the scheduler with tasks on virtual origins, and you can also work on making it more realistic, as it would be very useful to see how each policy performs

Desirable skills

  • Python 3 and Git are a must to work on any Software Heritage project
  • PostgreSQL, as the scheduler heavily relies on it to be efficient
  • Experience with scientific simulation is a plus, but not required

Potential mentors

  • Nicolas Dandrimont (olasd on IRC)
  • Valentin Lorentz (vlorentz on IRC)