Graph Dataset on Amazon Athena

From Software Heritage Wiki
Revision as of 18:09, 11 March 2019 by Seirl (talk | contribs) (Created page with "== Using Amazon Athena == The ''Software Heritage graph dataset'' is available as a public dataset in [https://aws.amazon.com/athena/ Amazon Athena]. === Setup === In order...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Using Amazon Athena

The Software Heritage graph dataset is available as a public dataset in Amazon Athena.

Setup

In order to query the dataset using Athena, you will first need to create an AWS account and setup billing.

Once your AWS account is ready, you will need to install a few dependencies on your machine:

On Debian, the dependencies can be installed with the following commands:

sudo apt install python3 python3-boto3 awscli

Once the dependencies are installed, run:

aws configure

and add your AWS Access Key ID and your AWS Secret Access Key, to give Python access to your AWS account.

Create the tables

To import the schema of the dataset into your account, download the scripts from the athena/ folder, then run the following command:

./gen_schema.py

This will create the required tables in your AWS account. You can check that the tables were successfuly created by going to the Amazon Athena console and selecting the "swh" database.

Run queries

From the console, once you have selected the "swh" database, you can directly run queries from the Query Editor.

Here is an example query that computes the most frequent file names in the archive:

SELECT FROM_UTF8(name, '?') AS name,
  COUNT(DISTINCT target) AS cnt
  FROM directory_entry_file
  GROUP BY name
  ORDER BY cnt DESC
  LIMIT 1;

More documentation on Amazon Athena is available here.