<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.softwareheritage.org/index.php?action=history&amp;feed=atom&amp;title=Graph_Dataset_on_Amazon_Athena</id>
	<title>Graph Dataset on Amazon Athena - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.softwareheritage.org/index.php?action=history&amp;feed=atom&amp;title=Graph_Dataset_on_Amazon_Athena"/>
	<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Graph_Dataset_on_Amazon_Athena&amp;action=history"/>
	<updated>2026-04-20T13:59:04Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.softwareheritage.org/index.php?title=Graph_Dataset_on_Amazon_Athena&amp;diff=975&amp;oldid=prev</id>
		<title>Seirl: Created page with &quot;== Using Amazon Athena ==  The ''Software Heritage graph dataset'' is available as a public dataset in [https://aws.amazon.com/athena/ Amazon Athena].  === Setup ===  In order...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.softwareheritage.org/index.php?title=Graph_Dataset_on_Amazon_Athena&amp;diff=975&amp;oldid=prev"/>
		<updated>2019-03-11T18:09:57Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;== Using Amazon Athena ==  The &amp;#039;&amp;#039;Software Heritage graph dataset&amp;#039;&amp;#039; is available as a public dataset in [https://aws.amazon.com/athena/ Amazon Athena].  === Setup ===  In order...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== Using Amazon Athena ==&lt;br /&gt;
&lt;br /&gt;
The ''Software Heritage graph dataset'' is available as a public dataset in [https://aws.amazon.com/athena/ Amazon Athena].&lt;br /&gt;
&lt;br /&gt;
=== Setup ===&lt;br /&gt;
&lt;br /&gt;
In order to query the dataset using Athena, you will first need to [https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/ create an AWS account and setup billing].&lt;br /&gt;
&lt;br /&gt;
Once your AWS account is ready, you will need to install a few dependencies on your machine:&lt;br /&gt;
&lt;br /&gt;
* Python 3&lt;br /&gt;
* The [https://docs.aws.amazon.com/cli/index.html aws cli]&lt;br /&gt;
* The [https://boto3.amazonaws.com/v1/documentation/api/latest/index.html boto3 Python package]&lt;br /&gt;
&lt;br /&gt;
On Debian, the dependencies can be installed with the following commands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;sudo apt install python3 python3-boto3 awscli&amp;lt;/pre&amp;gt;&lt;br /&gt;
Once the dependencies are installed, run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;aws configure&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and add your AWS Access Key ID and your AWS Secret Access Key, to give Python access to your AWS account.&lt;br /&gt;
&lt;br /&gt;
=== Create the tables ===&lt;br /&gt;
&lt;br /&gt;
To import the schema of the dataset into your account, download the scripts from the [https://annex.softwareheritage.org/public/dataset/swh-graph-2019-01-28/athena/ athena/] folder, then run the following command:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;./gen_schema.py&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will create the required tables in your AWS account. You can check that the tables were successfuly created by going to the [https://console.aws.amazon.com/athena/home Amazon Athena console] and selecting the &amp;quot;swh&amp;quot; database.&lt;br /&gt;
&lt;br /&gt;
=== Run queries ===&lt;br /&gt;
&lt;br /&gt;
From the console, once you have selected the &amp;quot;swh&amp;quot; database, you can directly run queries from the Query Editor.&lt;br /&gt;
&lt;br /&gt;
Here is an example query that computes the most frequent file names in the archive:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;SELECT FROM_UTF8(name, '?') AS name,&lt;br /&gt;
  COUNT(DISTINCT target) AS cnt&lt;br /&gt;
  FROM directory_entry_file&lt;br /&gt;
  GROUP BY name&lt;br /&gt;
  ORDER BY cnt DESC&lt;br /&gt;
  LIMIT 1;&amp;lt;/pre&amp;gt;&lt;br /&gt;
More documentation on Amazon Athena is available [https://docs.aws.amazon.com/athena/index.html here].&lt;/div&gt;</summary>
		<author><name>Seirl</name></author>
	</entry>
</feed>