Graph Dataset on Amazon Athena
Using Amazon Athena
The Software Heritage graph dataset is available as a public dataset in Amazon Athena.
Setup
In order to query the dataset using Athena, you will first need to create an AWS account and setup billing.
Once your AWS account is ready, you will need to install a few dependencies on your machine:
- Python 3
- The aws cli
- The boto3 Python package
On Debian, the dependencies can be installed with the following commands:
sudo apt install python3 python3-boto3 awscli
Once the dependencies are installed, run:
aws configure
and add your AWS Access Key ID and your AWS Secret Access Key, to give Python access to your AWS account.
Create the tables
To import the schema of the dataset into your account, download the scripts from the athena/ folder, then run the following command:
./gen_schema.py
This will create the required tables in your AWS account. You can check that the tables were successfuly created by going to the Amazon Athena console and selecting the "swh" database.
Run queries
From the console, once you have selected the "swh" database, you can directly run queries from the Query Editor.
Here is an example query that computes the most frequent file names in the archive:
SELECT FROM_UTF8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry_file GROUP BY name ORDER BY cnt DESC LIMIT 1;
More documentation on Amazon Athena is available here.