|
| 1 | +--- |
| 2 | +title: "Querying Amazon Athena from the comfort of your Python interpreter" |
| 3 | +author: "Anja Boskovic" |
| 4 | +error: false |
| 5 | +date: "2025-02-04" |
| 6 | +categories: |
| 7 | + - blog |
| 8 | + - athena |
| 9 | +--- |
| 10 | + |
| 11 | +Have you ever wanted to harness the power of AWS Athena, but found yourself |
| 12 | +tangled up in Presto SQL syntax? Good news! Ibis now supports [Amazon |
| 13 | +Athena](https://aws.amazon.com/athena/) as its [newest |
| 14 | +backend](https://ibis-project.org/backends/athena), bringing you the familiar |
| 15 | +comfort of DataFrame operations while tapping into AWS's robust data lake |
| 16 | +architecture. |
| 17 | + |
| 18 | +## Why? |
| 19 | + |
| 20 | +There's even more to love about this integration. Athena's pay-per-query |
| 21 | +pricing model means that users pay for each query they run. With Ibis' query |
| 22 | +optimisation before execution, you can potentially reduce costs without needing |
| 23 | +to agonise over query efficiency. Plus, since Athena can query data directly |
| 24 | +from S3, this new backend lets you analyse your data lake contents with beloved |
| 25 | +Python libraries like PyArrow and pandas without the hassle of downloading or |
| 26 | +moving massive datasets. |
| 27 | + |
| 28 | +## Installation Prerequisites |
| 29 | + |
| 30 | +Make sure you have an [IAM |
| 31 | +account](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-prereqs.html) |
| 32 | +and that your [credentials are in an expected location in your local |
| 33 | +environment](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). |
| 34 | + |
| 35 | +Additionally, using the same account and region that you are using for Athena, |
| 36 | +you will need to [create an S3 |
| 37 | +bucket](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html) |
| 38 | +where Athena can dump query results. This bucket will be set to |
| 39 | +`s3_staging_dir` in the connection call to the Athena backend. |
| 40 | + |
| 41 | +::: {.callout-note} |
| 42 | +If you are not able to query Athena through awscli, your queries will similarly |
| 43 | +not work on Ibis. Please note that AWS charges will apply for queries to Athena |
| 44 | +executed in following this tutorial. |
| 45 | +::: |
| 46 | + |
| 47 | +## Installation |
| 48 | + |
| 49 | +Install Ibis with the dependencies needed to work with AWS Athena: |
| 50 | + |
| 51 | +```bash |
| 52 | +$ pip install 'ibis-framework[athena]' |
| 53 | +``` |
| 54 | + |
| 55 | +## Data |
| 56 | + |
| 57 | +We are going to be creating some sample ecological data about ibis behaviour. |
| 58 | +The data will contain multiple columns with information about species, |
| 59 | +location, weather, group size, behaviour, and location temperature. |
| 60 | + |
| 61 | +```{python} |
| 62 | +import pandas as pd |
| 63 | +import numpy as np |
| 64 | +
|
| 65 | +
|
| 66 | +def create_observations(n: int, seed: int = 42) -> pd.DataFrame: |
| 67 | + ibis_species = ["Sacred Ibis", "Scarlet Ibis", "Glossy Ibis", "White Ibis"] |
| 68 | + locations = ["Wetland", "Grassland", "Coastal"] |
| 69 | + behaviors = ["Feeding", "Nesting", "Flying"] |
| 70 | + weather_conditions = ["Sunny", "Rainy"] |
| 71 | +
|
| 72 | + np.random.seed(seed) # For reproducibility |
| 73 | +
|
| 74 | + return pd.DataFrame( |
| 75 | + { |
| 76 | + "observation_date": np.full(n, np.datetime64("2024-01-01")) |
| 77 | + + np.random.randint(0, 365, size=n).astype("timedelta64[D]"), |
| 78 | + "species": np.random.choice(ibis_species, size=n), |
| 79 | + "location": np.random.choice(locations, size=n), |
| 80 | + "group_size": np.random.randint(1, 20, size=n), |
| 81 | + "behavior": np.random.choice(behaviors, size=n), |
| 82 | + "weather": np.random.choice(weather_conditions, size=n), |
| 83 | + "temperature_c": np.random.normal(25, 5, size=n) # Mean 25°C, std 5°C |
| 84 | + } |
| 85 | + ) |
| 86 | +
|
| 87 | +
|
| 88 | +ibis_observations = create_observations(1000) |
| 89 | +``` |
| 90 | + |
| 91 | +## Demo |
| 92 | + |
| 93 | +Let's start by opening a connection to AWS Athena with Ibis, using the S3 |
| 94 | +bucket we created to store query results. |
| 95 | + |
| 96 | +```{python} |
| 97 | +from ibis.interactive import * |
| 98 | +
|
| 99 | +con = ibis.athena.connect( |
| 100 | + s3_staging_dir="s3://aws-athena-query-results-ibis-testing", |
| 101 | + region_name="us-east-2", |
| 102 | +) |
| 103 | +``` |
| 104 | + |
| 105 | +Let's create some data using our `ibis_observations` pandas DataFrame. |
| 106 | + |
| 107 | +```{python} |
| 108 | +con.create_database("mydatabase", force=True) |
| 109 | +con.drop_table("ibis_observations", force=True) |
| 110 | +con.create_table("ibis_observations", obj=ibis_observations, database="mydatabase") |
| 111 | +con.list_tables(database="mydatabase") |
| 112 | +``` |
| 113 | +And we can grab information about table schemas to help us out with our |
| 114 | +queries: |
| 115 | + |
| 116 | +```{python} |
| 117 | +con.get_schema("ibis_observations", database="mydatabase") |
| 118 | +``` |
| 119 | + |
| 120 | +And now we are able to grab the table, and make some Ibis queries! Like what is |
| 121 | +the average group size by species? |
| 122 | + |
| 123 | +```{python} |
| 124 | +t = con.table("ibis_observations", database="mydatabase") |
| 125 | +
|
| 126 | +# Average group size by species |
| 127 | +t.group_by("species").aggregate(avg_group=t.group_size.mean()) |
| 128 | +``` |
| 129 | + |
| 130 | +And ibis does all the work on generating the Presto SQL that Athena can |
| 131 | +understand: |
| 132 | + |
| 133 | +How about most common behaviour during rainy weather? |
| 134 | + |
| 135 | +```{python} |
| 136 | +( |
| 137 | + t.filter(t.weather == "Rainy") |
| 138 | + .group_by("behavior") |
| 139 | + .aggregate(count=lambda t: t.count()) |
| 140 | + .order_by(ibis.desc("count")) |
| 141 | +) |
| 142 | +``` |
| 143 | + |
| 144 | +Temperature effects on behaviour? |
| 145 | + |
| 146 | +```{python} |
| 147 | +t.group_by("behavior").aggregate(avg_temp=t.temperature_c.mean()).order_by("avg_temp") |
| 148 | +``` |
| 149 | + |
| 150 | +Now that we're nearing the end of this demo, I wanted to show you that you can |
| 151 | +also delete tables and databases using ibis: |
| 152 | + |
| 153 | +```{python} |
| 154 | +con.drop_table("ibis_observations", database="mydatabase") |
| 155 | +con.drop_database("mydatabase") |
| 156 | +con.disconnect() |
| 157 | +``` |
| 158 | + |
| 159 | +You wouldn't need to fiddle with Athena's SDK! |
| 160 | + |
| 161 | +## How does this all work? |
| 162 | + |
| 163 | +Under the hood, AWS Athena runs on a version of Trino (formerly known as Presto |
| 164 | +SQL). Instead of writing a completely new SQL compiler for Athena, we were able |
| 165 | +to leverage Ibis' existing Trino compiler with some careful adjustments. |
| 166 | + |
| 167 | +This provides significant benefits in code efficiency - the Athena backend |
| 168 | +implementation required only about 40 lines of unique code. |
| 169 | + |
| 170 | +There are some nuances to note: since Athena runs on an older version of Trino, |
| 171 | +not all of Trino's newest features are available. For a detailed comparison of |
| 172 | +supported features across different backends, please check out the [Ibis |
| 173 | +backend support matrix](ihttps://ibis-project.org/backends/support/matrix). |
| 174 | + |
| 175 | +If you're new here, welcome. Here are some resources to learn more about Ibis: |
| 176 | + |
| 177 | +- [Ibis Docs](https://ibis-project.org/) |
| 178 | +- [Ibis GitHub](https://github.com/ibis-project/ibis) |
| 179 | + |
| 180 | +Chat with us on Zulip: |
| 181 | + |
| 182 | +- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/) |
0 commit comments