Crawling project metadata (internship): Difference between revisions

Latest revision as of 13:34, 26 September 2018

Construire le web sémantique des projets logiciels libres

(english description follows)

Contexte: Software Heritage, projet de recherche de grande envergure ayant comme but la récupération, l'archivage à très long terme, et le partage de la totalité du Logiciel Libre publiquement accessible en format code source.

Description: Ils existent des millions de projets de logiciels libres, hébergés sur des centaines de plateformes différentes, et souvent dupliqués. Pour naviguer dans ce graphe de projets logiciels, il est important de disposer de métadonnées pertinentes, et plusieurs efforts existent, autour de technologies du Web Sémantique comme DOAP ou schema.org. Le but de ce stage est de collecter les métadonnées existantes, les uniformiser, et les intégrer dans une des plus grandes collections de logiciels libres au monde.

Connaissances souhaitées pour accéder au stage:

information retrieval
modélisation et représentation des connaissances
manipulation de données semi-structurées (HTML, XML, etc.)
des notions de classification automatique et machine learning pourraient être utiles, mais ne sont pas indispensables

Établissement d'accueil: Inria Paris

Environnement: vous serez en immersion totale avec l'équipe qui construit l'archive de Software Heritage, et vous aurez la possibilité d'observer de près la construction d'un projet d'envergure mondiale.

Encadrants:

Roberto Di Cosmo <roberto@dicosmo.org>
Stefano Zacchiroli <zack@upsilon.cc>

Building the Semantic Web of FOSS projects

Context: Software Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible Free/Open Source Software (FOSS) in source code form.

Description: FOSS is today a significant part of the software production of mankind: there are millions of FOSS projects, developed on hundreds of different hosting platforms. Popular projects are often cloned or forked, and many projects change their reference hoster platform during their lifetime, leading to a tangled web which is not easy to explore.

To navigate and search through this maze of FOSS projects, one needs quick and easy access to the relevant project metadata, which are available in various formats and ontologies, such as DOAP, ADMS.sw or schema.org. The goal of this internship is to collect existing metadata, reconciliate them to some extent, and integrate them into the biggest FOSS source code archive of the world.

Desirable skills to obtain this internship:

information retrieval
knowledge modeling and representation
markup languages and manipulation of semi-structured data (HTML, XML, etc.)
working knowledge of automatic classification and machine learning would be a plus

Workplace: Inria Paris

Environnement: you will work shoulder to shoulder with all members of the Software Heritage team, and you will have a chance to witness from within the construction of the ultimate source code archive.

Internship mentors:

Roberto Di Cosmo <roberto@dicosmo.org>
Stefano Zacchiroli <zack@upsilon.cc>

@@ Line 1: / Line 1: @@
 == Construire le web sémantique des projets logiciels libres ==
-'''Contexte''': projet de recherche de grande envergure ayant comme but la
+(english description follows)
-récupération, l'organisation, et l'archivage à très long terme (siècles) de la
-totalité du logiciel libre publiquement accessible via Internet.
+'''Contexte''': [https://www.softwareheritage.org/ Software Heritage], projet
+de recherche de grande envergure ayant comme but la récupération, l'archivage
+à très long terme, et le partage de la totalité du Logiciel Libre publiquement
+accessible en format code source.
 '''Description''':
@@ Line 18: / Line 21: @@
 * modélisation et représentation des connaissances
 * manipulation de données semi-structurées (HTML, XML, etc.)
+* des notions de classification automatique et machine learning pourraient être utiles, mais ne sont pas indispensables
 '''Établissement d'accueil''': Inria Paris
+'''Environnement''': vous serez en immersion totale avec l'équipe qui construit l'archive de Software Heritage, et vous aurez la possibilité d'observer de près la construction d'un projet d'envergure mondiale.
 '''Encadrants''':
@@ Line 26: / Line 32: @@
+== Building the Semantic Web of FOSS projects ==
+'''Context''': [https://www.softwareheritage.org/ Software Heritage] is an
+ambitious research project whose goal is to collect, preserve in the very long
+term, and share the whole publicly accessible Free/Open Source Software
+(FOSS) in source code form.
+'''Description''':
+FOSS is today a significant part of the software production of mankind:  there are millions
+of FOSS projects, developed on hundreds of different hosting platforms. Popular projects
+are often cloned or forked, and many projects change their reference hoster platform during
+their lifetime, leading to a tangled web which is not easy to explore.
+To navigate and search through this maze of FOSS projects, one needs quick and easy access
+to the relevant project metadata, which are available in various formats and
+ontologies, such as DOAP, ADMS.sw or schema.org. The goal of this internship is to collect
+existing metadata, reconciliate them to some extent, and integrate them into the
+biggest FOSS source code archive of the world.
+'''Desirable skills''' to obtain this internship:
+* information retrieval
+* knowledge modeling and representation
+* markup languages and manipulation of semi-structured data (HTML, XML, etc.)
+* working knowledge of automatic classification and machine learning would be a plus
+'''Workplace''': Inria Paris
+'''Environnement''': you will work shoulder to shoulder with all members of the
+Software Heritage team, and you will have a chance to witness from within the
+construction of the ultimate source code archive.
+'''Internship mentors''':
+* Roberto Di Cosmo <roberto@dicosmo.org>
+* Stefano Zacchiroli <zack@upsilon.cc>
+[[Category:Completed internship]]
 [[Category:Internship]]
 [[Category:Lang:French]]
+[[Category:Lang:English]]

Crawling project metadata (internship): Difference between revisions

Latest revision as of 13:34, 26 September 2018

Construire le web sémantique des projets logiciels libres

Building the Semantic Web of FOSS projects

Navigation menu

Search