Difference between revisions of "Google Summer of Code 2019/Increase archive coverage"

From Software Heritage Wiki
Jump to navigation Jump to search
 
(4 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
=== Description:===
 
=== Description:===
As Software Heritage works on archiving and sharing source code, one of the major tasks is to ingest the latest source code available in the database from time to time and from all the possible sources where you can fetch the source code using listers and ingest them using loaders. [https://docs.softwareheritage.org/devel/swh-lister/index.html#swh-lister Listers] are components that crawl the APIs of software forges (e.g., Bitbucket, GitHub, Sourceforge, ...) and return a list of the software available in it whereas [Loaders take a bundle of software (tarball, Git repository ...) and load it into Software Heritage, by adapting it so that it matches the archive data model. The goal of this project is to increase the archive coverage by making listers and loaders for different websites that which stores source code, so that Software Heritage can fetch as much source code as possible and store it in the database to preserve it for future generations.
+
 
 +
 
 +
The goal of this project is to increase the archive coverage by making listers and loaders for different forges.  
 +
[https://docs.softwareheritage.org/devel/swh-lister/index.html#swh-lister Listers] are components that crawl the APIs of software forges (e.g., Bitbucket, GitHub, Sourceforge, ...) and return a list of the software available in it. Loaders take a bundle of software (tarball, Git repository ...) and load it into Software Heritage, by adapting it so that it matches the archive data model.
  
 
===Student: ===  
 
===Student: ===  
 
Archit Agrawal
 
Archit Agrawal
 
* [https://forge.softwareheritage.org/p/nahimilega/ Forge activity]
 
* [https://forge.softwareheritage.org/p/nahimilega/ Forge activity]
 +
* [https://wiki.softwareheritage.org/wiki/Google_Summer_of_Code_2019/Increase_archive_coverage/Commit_list List Of Commits]
  
 
=== Mentors:===
 
=== Mentors:===
Line 27: Line 31:
 
*** [https://forge.softwareheritage.org/T1724 Maven Lister]
 
*** [https://forge.softwareheritage.org/T1724 Maven Lister]
 
**  [https://forge.softwareheritage.org/rDLS08ade29e6de0616a3964360454ab52b58c082b75 Add tests to PyPI Lister]
 
**  [https://forge.softwareheritage.org/rDLS08ade29e6de0616a3964360454ab52b58c082b75 Add tests to PyPI Lister]
** [https://forge.softwareheritage.org/rDLSf424f07c7e628eb7a19d25f4fdb749682d97a21f Refractor base tests for listers]
+
** [https://forge.softwareheritage.org/rDLSf424f07c7e628eb7a19d25f4fdb749682d97a21f Refactor base tests for listers]
 
**  [https://forge.softwareheritage.org/D1441 Add documentation on *How to run a new lister*]
 
**  [https://forge.softwareheritage.org/D1441 Add documentation on *How to run a new lister*]
 
* '''Loaders:'''
 
* '''Loaders:'''
Line 33: Line 37:
 
*** Ingesting source code from package managers is a process somewhat similar for all of the package managers. This calls for a common base implementation for loading content from package managers into the archive. I worked on this idea, analysed the steps required to make a loader and the implementation of present package manager loader. Came up with the plan to implement the base loader and made the pass([https://forge.softwareheritage.org/D1694 D1694], [https://forge.softwareheritage.org/D1810 D1810], [https://forge.softwareheritage.org/D1811 D1811], [https://forge.softwareheritage.org/D1812 D1812], [https://forge.softwareheritage.org/D1813 D1813], [https://forge.softwareheritage.org/D1814 D1814], [https://forge.softwareheritage.org/D1744 D1744]). However, after the recommendation from my mentor, we changed the approach to make the base loader. Instead of making the whole base loader in one go, we decided to break it into multiple steps(3 steps) and follow the incremental approach.
 
*** Ingesting source code from package managers is a process somewhat similar for all of the package managers. This calls for a common base implementation for loading content from package managers into the archive. I worked on this idea, analysed the steps required to make a loader and the implementation of present package manager loader. Came up with the plan to implement the base loader and made the pass([https://forge.softwareheritage.org/D1694 D1694], [https://forge.softwareheritage.org/D1810 D1810], [https://forge.softwareheritage.org/D1811 D1811], [https://forge.softwareheritage.org/D1812 D1812], [https://forge.softwareheritage.org/D1813 D1813], [https://forge.softwareheritage.org/D1814 D1814], [https://forge.softwareheritage.org/D1744 D1744]). However, after the recommendation from my mentor, we changed the approach to make the base loader. Instead of making the whole base loader in one go, we decided to break it into multiple steps(3 steps) and follow the incremental approach.
 
**'''[https://forge.softwareheritage.org/D1824 GNU Loader]'''
 
**'''[https://forge.softwareheritage.org/D1824 GNU Loader]'''
*** As part of the first step towards the implementation of Base Loader, GNU Loader was implemented.  
+
*** As part of the first step towards the implementation of Base Loader, GNU Loader was implemented.
  
 
===TO-DO:===
 
===TO-DO:===
Line 41: Line 45:
  
 
=== Learnings: ===
 
=== Learnings: ===
Working in Software Heritage was a wholesome experience. I got to learn a new thing almost every day. It would me injustice id I say I can account all my learnings in a section of a blog, however here are a list of few of most prominent once:  
+
Working in Software Heritage was a wholesome experience. I got to learn a new thing almost every day. Here is a few of the most prominent ones:  
*Working on a huge codebase
+
*Work on a huge codebase
 
*Plan and design before jumping to code
 
*Plan and design before jumping to code
*Writing clean and well-commented code
+
*Write clean and well-commented code
*Difference between doing projects in college and in the industry(Spoiler Alert: '''A lot''')
+
*Learn the difference between doing projects in college and in the industry(Spoiler Alert: '''A lot''')
 
*Multiple language integration in a python library (Used in CRAN Lister)
 
*Multiple language integration in a python library (Used in CRAN Lister)
 
*Different programming methodologies explained to me by my mentors(eg [https://en.wikipedia.org/wiki/Test-driven_development TDD])
 
*Different programming methodologies explained to me by my mentors(eg [https://en.wikipedia.org/wiki/Test-driven_development TDD])
*Working with git and forge
+
*Work with tools; DVCS (git), issue tracker (phabricator forge), containerization/virtualization (docker)
*Working with Docker
 
  
 
=== Activity reports:===
 
=== Activity reports:===

Latest revision as of 21:46, 28 August 2019

Title:

Increase archive coverage

Description:

The goal of this project is to increase the archive coverage by making listers and loaders for different forges. Listers are components that crawl the APIs of software forges (e.g., Bitbucket, GitHub, Sourceforge, ...) and return a list of the software available in it. Loaders take a bundle of software (tarball, Git repository ...) and load it into Software Heritage, by adapting it so that it matches the archive data model.

Student:

Archit Agrawal

Mentors:

  • Nicolas Dandrimont
  • Antoine R. Dumont

Work Done:

TO-DO:

  • Implement the Listers using the research done and the implementation plan made for Launchpad, Rubygem.
  • Find the workarounds to solve the challenges in making the Maven and NuGET(.NET) Lister.
  • Work on the remaining steps in order to complete the Base Package Manager Loader.

Learnings:

Working in Software Heritage was a wholesome experience. I got to learn a new thing almost every day. Here is a few of the most prominent ones:

  • Work on a huge codebase
  • Plan and design before jumping to code
  • Write clean and well-commented code
  • Learn the difference between doing projects in college and in the industry(Spoiler Alert: A lot)
  • Multiple language integration in a python library (Used in CRAN Lister)
  • Different programming methodologies explained to me by my mentors(eg TDD)
  • Work with tools; DVCS (git), issue tracker (phabricator forge), containerization/virtualization (docker)

Activity reports:

Links