Skip to content

Commit f2f09eb

Browse files
anjakefalacpcloud
andauthored
docs: add blogpost for Athena backend (#10796)
Co-authored-by: Phillip Cloud <[email protected]>
1 parent c34ab76 commit f2f09eb

File tree

2 files changed

+198
-0
lines changed

2 files changed

+198
-0
lines changed

docs/_freeze/posts/ibis-athena/index/execute-results/html.json

Lines changed: 16 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/posts/ibis-athena/index.qmd

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
---
2+
title: "Querying Amazon Athena from the comfort of your Python interpreter"
3+
author: "Anja Boskovic"
4+
error: false
5+
date: "2025-02-04"
6+
categories:
7+
- blog
8+
- athena
9+
---
10+
11+
Have you ever wanted to harness the power of AWS Athena, but found yourself
12+
tangled up in Presto SQL syntax? Good news! Ibis now supports [Amazon
13+
Athena](https://aws.amazon.com/athena/) as its [newest
14+
backend](https://ibis-project.org/backends/athena), bringing you the familiar
15+
comfort of DataFrame operations while tapping into AWS's robust data lake
16+
architecture.
17+
18+
## Why?
19+
20+
There's even more to love about this integration. Athena's pay-per-query
21+
pricing model means that users pay for each query they run. With Ibis' query
22+
optimisation before execution, you can potentially reduce costs without needing
23+
to agonise over query efficiency. Plus, since Athena can query data directly
24+
from S3, this new backend lets you analyse your data lake contents with beloved
25+
Python libraries like PyArrow and pandas without the hassle of downloading or
26+
moving massive datasets.
27+
28+
## Installation Prerequisites
29+
30+
Make sure you have an [IAM
31+
account](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-prereqs.html)
32+
and that your [credentials are in an expected location in your local
33+
environment](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).
34+
35+
Additionally, using the same account and region that you are using for Athena,
36+
you will need to [create an S3
37+
bucket](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html)
38+
where Athena can dump query results. This bucket will be set to
39+
`s3_staging_dir` in the connection call to the Athena backend.
40+
41+
::: {.callout-note}
42+
If you are not able to query Athena through awscli, your queries will similarly
43+
not work on Ibis. Please note that AWS charges will apply for queries to Athena
44+
executed in following this tutorial.
45+
:::
46+
47+
## Installation
48+
49+
Install Ibis with the dependencies needed to work with AWS Athena:
50+
51+
```bash
52+
$ pip install 'ibis-framework[athena]'
53+
```
54+
55+
## Data
56+
57+
We are going to be creating some sample ecological data about ibis behaviour.
58+
The data will contain multiple columns with information about species,
59+
location, weather, group size, behaviour, and location temperature.
60+
61+
```{python}
62+
import pandas as pd
63+
import numpy as np
64+
65+
66+
def create_observations(n: int, seed: int = 42) -> pd.DataFrame:
67+
ibis_species = ["Sacred Ibis", "Scarlet Ibis", "Glossy Ibis", "White Ibis"]
68+
locations = ["Wetland", "Grassland", "Coastal"]
69+
behaviors = ["Feeding", "Nesting", "Flying"]
70+
weather_conditions = ["Sunny", "Rainy"]
71+
72+
np.random.seed(seed) # For reproducibility
73+
74+
return pd.DataFrame(
75+
{
76+
"observation_date": np.full(n, np.datetime64("2024-01-01"))
77+
+ np.random.randint(0, 365, size=n).astype("timedelta64[D]"),
78+
"species": np.random.choice(ibis_species, size=n),
79+
"location": np.random.choice(locations, size=n),
80+
"group_size": np.random.randint(1, 20, size=n),
81+
"behavior": np.random.choice(behaviors, size=n),
82+
"weather": np.random.choice(weather_conditions, size=n),
83+
"temperature_c": np.random.normal(25, 5, size=n) # Mean 25°C, std 5°C
84+
}
85+
)
86+
87+
88+
ibis_observations = create_observations(1000)
89+
```
90+
91+
## Demo
92+
93+
Let's start by opening a connection to AWS Athena with Ibis, using the S3
94+
bucket we created to store query results.
95+
96+
```{python}
97+
from ibis.interactive import *
98+
99+
con = ibis.athena.connect(
100+
s3_staging_dir="s3://aws-athena-query-results-ibis-testing",
101+
region_name="us-east-2",
102+
)
103+
```
104+
105+
Let's create some data using our `ibis_observations` pandas DataFrame.
106+
107+
```{python}
108+
con.create_database("mydatabase", force=True)
109+
con.drop_table("ibis_observations", force=True)
110+
con.create_table("ibis_observations", obj=ibis_observations, database="mydatabase")
111+
con.list_tables(database="mydatabase")
112+
```
113+
And we can grab information about table schemas to help us out with our
114+
queries:
115+
116+
```{python}
117+
con.get_schema("ibis_observations", database="mydatabase")
118+
```
119+
120+
And now we are able to grab the table, and make some Ibis queries! Like what is
121+
the average group size by species?
122+
123+
```{python}
124+
t = con.table("ibis_observations", database="mydatabase")
125+
126+
# Average group size by species
127+
t.group_by("species").aggregate(avg_group=t.group_size.mean())
128+
```
129+
130+
And ibis does all the work on generating the Presto SQL that Athena can
131+
understand:
132+
133+
How about most common behaviour during rainy weather?
134+
135+
```{python}
136+
(
137+
t.filter(t.weather == "Rainy")
138+
.group_by("behavior")
139+
.aggregate(count=lambda t: t.count())
140+
.order_by(ibis.desc("count"))
141+
)
142+
```
143+
144+
Temperature effects on behaviour?
145+
146+
```{python}
147+
t.group_by("behavior").aggregate(avg_temp=t.temperature_c.mean()).order_by("avg_temp")
148+
```
149+
150+
Now that we're nearing the end of this demo, I wanted to show you that you can
151+
also delete tables and databases using ibis:
152+
153+
```{python}
154+
con.drop_table("ibis_observations", database="mydatabase")
155+
con.drop_database("mydatabase")
156+
con.disconnect()
157+
```
158+
159+
You wouldn't need to fiddle with Athena's SDK!
160+
161+
## How does this all work?
162+
163+
Under the hood, AWS Athena runs on a version of Trino (formerly known as Presto
164+
SQL). Instead of writing a completely new SQL compiler for Athena, we were able
165+
to leverage Ibis' existing Trino compiler with some careful adjustments.
166+
167+
This provides significant benefits in code efficiency - the Athena backend
168+
implementation required only about 40 lines of unique code.
169+
170+
There are some nuances to note: since Athena runs on an older version of Trino,
171+
not all of Trino's newest features are available. For a detailed comparison of
172+
supported features across different backends, please check out the [Ibis
173+
backend support matrix](ihttps://ibis-project.org/backends/support/matrix).
174+
175+
If you're new here, welcome. Here are some resources to learn more about Ibis:
176+
177+
- [Ibis Docs](https://ibis-project.org/)
178+
- [Ibis GitHub](https://github.com/ibis-project/ibis)
179+
180+
Chat with us on Zulip:
181+
182+
- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/)

0 commit comments

Comments
 (0)