AWS Athena is a serverless query service that you can use to query or analyze data from various sources like S3, Hive, Redshift, and many more. You don’t have to maintain your server to run the queries on existing data. In this article, we will discuss below AWS Athena concepts:

Data Sources for AWS Athena:

AWS Athena supports multiple data sources and is integrated with the AWS Glue catalog out of the box. The connectors help communicate with different Data Sources and return the SQL query result to AWS Athena. Connectors are deployed as a serverless app and use AWS Lambda for processing. The support Data source types are:

S3- AWS Glue Data Catalog

  1. It allows querying the data from S3.
  2. The database and tables (views) can be created using Glue or AWS Athena Create External table statement.
  3. It will enable the use of the Glue catalog present in the same AWS account or different AWS account.
  4. You can create the tables using two other options in the Glue catalog is part of the same AWS account:
    1. Create a crawler in AWS Glue
    2. Create a table using the AWS Athena table page from the S3 bucket data
  5. When you select “AWS Glue Data Catalog in another account,” you will need to provide the Glue Catalog ID belonging to the different AWS account.

S3 – Apache Hive Store

  1. It will enable querying the data from S3.
  2. You can run the queries against Hive external tables created pointing to the S3 location.

Amazon CloudWatch Logs

  1. It allows querying data from CloudWatch Logs.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch

Amazon CloudWatch Metrics

  1. It allows querying data from CloudWatch Metrics.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch-metrics

Amazon DocumentDB

  1. It allows querying data from DocumentDB.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-docdb

Amazon DynamoDB

  1. It allows querying data from DynamoDB.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb

Amazon Redshift

  1. It allows querying data from Redshift.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-jdbc

Apache HBase

  1. It allows querying data from Apache HBase.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-hbase

MySQL

  1. It allows querying data from MySQL.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-jdbc

PostgreSQL

  1. It allows querying data from PostgreSQL.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-jdbc

Redis

  1. It allows querying data from Redis.
  2. Connector: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-redis

Workgroups

You can use the Athena workgroups to separate teams, users, and queries. You can set the limit on data accessed per query or workgroup and set alerts when you reach the threshold. While creating the workgroup, you will need to specify:

  1. Workgroup name
  2. Workgroup Description
  3. Upgrade Query Engine (Automatic or manual)
  4. Query result location and encryption:
    1. Each workgroup requires the S3 location to save query results.
    2. The query results stored on the S3 location can also be encrypted.
    3. You can select below encryption mechanisms:
      1. SSE_S3
      2. SSE_KMS
      3. CSE_KMS
  5. Settings:
    1. Metrics: When you enable it, Athena collects various query metrics like run time, scanned data, and sent to AWS Cloudwatch metrics.
    2. Override: All the client-side settings will be overwritten with WorkGroup settings when selecting this option.
    3. Requester pays: The query requestor pays for the query execution.
  6. Per Query Data Usage Control
    1. The maximum amount of data query is allowed to scan.
  7. Workgroup[ Data Usage alerts
    1. You can set the Data threshold (Maximum limit 7EB) per period. 
    2. Once the threshold is reached, AWS will send the email alert to administrators through an SNS-Email subscription.
  8. Tags

Query Editor:

Athena Query allows you to:

  1. View Data Source
  2. View Database
  3. View Tables/Views and schema associated with it.
  4. Type and execute the query.
  5. See the query result.
  6. Execute the query result against the particular workgroup.

Alternatively, you also will be able to run the queries using Boto3 SDK.

Athena Best Practices

  1. Partition your data and restrict the partitions using WHERE clause in SQL query.
  2. Bucket the data and add bucketing information while creating the Athena/Glue tables.
  3. Use compression to store the underlying data files. The supported compression types are GZIP, bzip2, lzo and snappy.
  4. Avoid having many small files. Having one large file instead of multiple small files improves the query performance.
  5. Use file types that allow columnar storage, e.g., parquet or ORC.
  6. Optimize the query.
  7. Use approximate functions wherever possible.

Athena-Boto3 Operations:

How to create a Boto3 Athena client?

import boto3

athena_client = boto3.client("athena", region_name="us-east-1")

How to create a data catalog using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def create_data_catalog(
    catalog_name: str,
    catalog_description: str,
    parameters: dict = None,
    tags: List[dict] = None,
) -> dict:
    """Create Glue Data Catalog

    :param catalog_name: Name of Glue catalog to be created
    :param catalog_description: Glue catalog Description
    :param parameters: Glue catalog parameters
    :param tags: Glue catalog Tags
    :return: Dict
    """

    if parameters is None:
        current_aws_account = boto3.client("sts").get_caller_identity().get("Account")
        parameters = {"catalog-id": current_aws_account}

    if tags is None:
        tags = [
            {"Key": "tag_key", "Value": "tag_value"},
        ]

    response = athena_client.create_data_catalog(
        Name=catalog_name,
        Type="GLUE",
        Description=catalog_description,
        Parameters=parameters,
        Tags=tags,
    )
    return response

Sample Response: {'ResponseMetadata': {'RequestId': 'dcaad8eb-5f82-48d1-aba2-3278e6b014d9', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:31 GMT', 'x-amzn-requestid': 'dcaad8eb-5f82-48d1-aba2-3278e6b014d9', 'content-length': '2', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to create a workgroup using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def create_work_group(
    workgroup_name: str,
    workgroup_description: str,
    output_location: str,
    encryption_setting: dict = None,
    enforce_config: bool = False,
    publish_cw_metrics: bool = False,
    requester_pays: bool = False,
    tags: List[dict] = None,
) -> dict:
    """Creates Athena workgroup

    :param workgroup_name: Name of workgroup to be created.
    :param workgroup_description: Workgroup description.
    :param output_location: S3 Output location to save query result.
    :param encryption_setting: Encrypt the Query results.
    :param enforce_config: Boolean flag indicating if workgroup query will override the settings.
    :param publish_cw_metrics: Boolean flag to publish cloudwatch metrics.
    :param requester_pays: Boolean flag indicating if requester pays for the query.
    :param tags: Tags
    :return: Dict
    """
    if encryption_setting is None:
        encryption_setting = {"EncryptionOption": "SSE_S3"}

    if tags is None:
        tags = [
            {"Key": "tag_key", "Value": "tag_value"},
        ]

    response = athena_client.create_work_group(
        Name=workgroup_name,
        Configuration={
            "ResultConfiguration": {
                "OutputLocation": output_location,
                "EncryptionConfiguration": encryption_setting,
            },
            "EnforceWorkGroupConfiguration": enforce_config,
            "PublishCloudWatchMetricsEnabled": publish_cw_metrics,
            "RequesterPaysEnabled": requester_pays,
        },
        Description=workgroup_description,
        Tags=tags,
    )
    return response

Sample response:
{'ResponseMetadata': {'RequestId': '98b4fc59-0c2a-4e29-b384-465ececac670', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:31 GMT', 'x-amzn-requestid': '98b4fc59-0c2a-4e29-b384-465ececac670', 'content-length': '2', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to list all the data catalogs using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def list_data_catalogs(max_results: int = 50) -> dict:
    """Lists all data catalogs

    :param max_results: max results to return
    :return: dict containing data catalogs.
    """
    response = athena_client.list_data_catalogs(MaxResults=max_results)
    return response

Sample Response
{'DataCatalogsSummary': [{'CatalogName': 'AwsDataCatalog', 'Type': 'GLUE'}, {'CatalogName': 'test_catalog', 'Type': 'GLUE'}], 'ResponseMetadata': {'RequestId': 'f508f556-4428-4e8b-85ee-7097ea85e274', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:32 GMT', 'x-amzn-requestid': 'f508f556-4428-4e8b-85ee-7097ea85e274', 'content-length': '117', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to list all the databases using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")

def list_databases(catalog_name: str, max_results: int = 50) -> dict:
    """Lists the first 50 databases in Glue catalog.

    :param catalog_name: Catalog name.
    :param max_results: Max number of databases
    :return: dict
    """

    response = athena_client.list_databases(
        CatalogName=catalog_name, MaxResults=max_results
    )
    return response

Sample Response
{'DatabaseList': [{'Name': 'default'}, {'Name': 'test_database'}], 'ResponseMetadata': {'RequestId': '4b9d37d9-c007-42c5-b5db-e37b80008d89', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:32 GMT', 'x-amzn-requestid': '4b9d37d9-c007-42c5-b5db-e37b80008d89', 'content-length': '62', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to list the workgroups using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def list_work_groups(max_results: int = 50) -> dict:
    """Lists first 50 Athena workgroups

    :param max_results: max results to return
    :return: dict containing Athena workgroup.
    """

    response = athena_client.list_work_groups(MaxResults=max_results)
    return response

Sample response:
{'WorkGroups': [{'Name': 'primary', 'State': 'ENABLED', 'Description': '', 'CreationTime': datetime.datetime(2020, 5, 15, 1, 46, 30, 873000, tzinfo=tzlocal()), 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 2'}}, {'Name': 'test_workgroup', 'State': 'ENABLED', 'Description': 'test_descrption', 'CreationTime': datetime.datetime(2022, 1, 15, 19, 11, 32, 427000, tzinfo=tzlocal()), 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 2'}}], 'ResponseMetadata': {'RequestId': '49bf2267-842c-4ad7-9ffb-0d29703bbe48', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:33 GMT', 'x-amzn-requestid': '49bf2267-842c-4ad7-9ffb-0d29703bbe48', 'content-length': '410', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to get the data catalog details using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def get_data_catalog(catalog_id: str) -> dict:
    """Get the data catalog details.

    :param catalog_id: Name of data catalog
    :return: Data catalog object
    """
    response = athena_client.get_data_catalog(Name=catalog_id)
    return response

Sample Response:
{'DataCatalog': {'Name': 'test_catalog', 'Description': 'test_description', 'Type': 'GLUE', 'Parameters': {'catalog-id': '0123456789012'}}, 'ResponseMetadata': {'RequestId': '1cbcb4b6-fc8a-47f5-9d88-4a4880909128', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:33 GMT', 'x-amzn-requestid': '1cbcb4b6-fc8a-47f5-9d88-4a4880909128', 'content-length': '129', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to get the database details using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def get_database(catalog_name: str, database_name: str) -> dict:
    """Get the database details.

    :param catalog_name: The catalog containing the databse.
    :param database_name: The databse name
    :return: The databse details.
    """
    response = athena_client.get_database(
        CatalogName=catalog_name, DatabaseName=database_name
    )
    return response

Sample Response
{'Database': {'Name': 'test_database'}, 'ResponseMetadata': {'RequestId': '58e8c164-9b75-4f09-b90a-ef7be02990c3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:34 GMT', 'x-amzn-requestid': '58e8c164-9b75-4f09-b90a-ef7be02990c3', 'content-length': '37', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to get the workgroup details using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def get_work_group(workgroup_name: str) -> dict:
    """Get the work group details.

    :param workgroup_name: Workgroup name to fetch details for.
    :return: Work group details.
    """
    response = athena_client.get_work_group(WorkGroup=workgroup_name)
    return response

Sample Response:
{'WorkGroup': {'Name': 'test_workgroup', 'State': 'ENABLED', 'Configuration': {'ResultConfiguration': {'OutputLocation': 's3://testbucketforathenademo', 'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}, 'EnforceWorkGroupConfiguration': False, 'PublishCloudWatchMetricsEnabled': False, 'RequesterPaysEnabled': False, 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 2'}}, 'Description': 'test_descrption', 'CreationTime': datetime.datetime(2022, 1, 15, 19, 11, 32, 427000, tzinfo=tzlocal())}, 'ResponseMetadata': {'RequestId': 'aebc405a-0562-48c3-9d61-5856e1740ce7', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:35 GMT', 'x-amzn-requestid': 'aebc405a-0562-48c3-9d61-5856e1740ce7', 'content-length': '506', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to start Athena query using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def start_query_execution(query: str, database: str, work_group: str) -> dict:
    """Starts Athena query execution.

    :param query: Athena SQL query
    :param database: The database to run query against.
    :param work_group: The workgroup to use for query execution
    :return: dict
    """
    response = athena_client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={"Database": database},
        WorkGroup=work_group,
    )
    return response

Sample Response
{'QueryExecutionId': 'b27b7259-9606-4a4c-bf58-d1e8f76f3bdc', 'ResponseMetadata': {'RequestId': 'a8bdeaa6-46b0-46f5-8555-c236131f79aa', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:35 GMT', 'x-amzn-requestid': 'a8bdeaa6-46b0-46f5-8555-c236131f79aa', 'content-length': '59', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to check Athena query status using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def get_query_execution(query_id: str) -> dict:
    """Get Athena query status.

    :param query_id: Query ID to get status for.
    :return: dict
    """
    response = athena_client.get_query_execution(QueryExecutionId=query_id)
    return response

Sample Response
{'QueryExecution': {'QueryExecutionId': 'b27b7259-9606-4a4c-bf58-d1e8f76f3bdc', 'Query': 'select 1', 'StatementType': 'DML', 'ResultConfiguration': {'OutputLocation': 's3://testbucketforathenademo/b27b7259-9606-4a4c-bf58-d1e8f76f3bdc.csv', 'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}, 'QueryExecutionContext': {'Database': 'test_database'}, 'Status': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2022, 1, 15, 19, 11, 35, 844000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2022, 1, 15, 19, 11, 36, 128000, tzinfo=tzlocal())}, 'Statistics': {'EngineExecutionTimeInMillis': 117, 'DataScannedInBytes': 0, 'TotalExecutionTimeInMillis': 284, 'QueryQueueTimeInMillis': 124, 'ServiceProcessingTimeInMillis': 43}, 'WorkGroup': 'test_workgroup', 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 2'}}, 'ResponseMetadata': {'RequestId': 'aed7cd86-47d4-4b41-ad85-892d9b2c4fce', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:35 GMT', 'x-amzn-requestid': 'aed7cd86-47d4-4b41-ad85-892d9b2c4fce', 'content-length': '1423', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to get Athena query results using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def get_query_results(query_id: str) -> dict:
    """Get Athena query result.

    :param query_id: Query ID to get result for.
    :return: dict
    """
    response = athena_client.get_query_results(QueryExecutionId=query_id, MaxResults=10)
    return response

Sample Response
{'UpdateCount': 0, 'ResultSet': {'Rows': [{'Data': [{'VarCharValue': '_col0'}]}, {'Data': [{'VarCharValue': '1'}]}], 'ResultSetMetadata': {'ColumnInfo': [{'CatalogName': 'hive', 'SchemaName': '', 'TableName': '', 'Name': '_col0', 'Label': '_col0', 'Type': 'integer', 'Precision': 10, 'Scale': 0, 'Nullable': 'UNKNOWN', 'CaseSensitive': False}]}}, 'ResponseMetadata': {'RequestId': '08770880-8dda-414d-8f9a-22e075d73f40', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:36 GMT', 'x-amzn-requestid': '08770880-8dda-414d-8f9a-22e075d73f40', 'content-length': '550', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to stop Athena query execution?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def stop_query_execution(query_id: str) -> dict:
    """Stops athena query execution.

    :param query_id: Query ID to stop execution.
    :return: dict
    """
    response = athena_client.stop_query_execution(QueryExecutionId=query_id)
    return response

Sample Response
{'ResponseMetadata': {'RequestId': '2aec07e3-bf2f-4e43-8a2b-f8b39c0d6014', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:37 GMT', 'x-amzn-requestid': '2aec07e3-bf2f-4e43-8a2b-f8b39c0d6014', 'content-length': '2', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

How to delete the data catalog using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def delete_data_catalog(catalog_name: str) -> dict:
    """Deletes the Athena data catalog

    :param catalog_name: The catalog name to be deleted
    :return: Dict
    """
    response = athena_client.delete_data_catalog(Name=catalog_name)
    return response

Sample Response
{'ResponseMetadata': {'RequestId': 'ad82ea69-8376-4652-b0f9-a50c19cabd62', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:37 GMT', 'x-amzn-requestid': 'ad82ea69-8376-4652-b0f9-a50c19cabd62', 'content-length': '2', 'connection': 'keep-alive'}, 'RetryAttempts': 0}} 

How to delete the workgroup using the Athena Boto3 client?

from typing import List

import boto3


athena_client = boto3.client("athena", region_name="us-east-1")


def delete_work_group(work_group_name: str, recursive_delete: bool = False) -> dict:
    """Deletes the Athena work group

    :param work_group_name: The workgroup to be deleted
    :param recursive_delete: Recursively delete the workgroup.
    :return: Dict
    """
    response = athena_client.delete_work_group(
        WorkGroup=work_group_name, RecursiveDeleteOption=recursive_delete
    )
    return response

Sample Response
{'ResponseMetadata': {'RequestId': '18ce9c47-0d5d-4752-8f6b-72478ee5346b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Sat, 15 Jan 2022 13:41:37 GMT', 'x-amzn-requestid': '18ce9c47-0d5d-4752-8f6b-72478ee5346b', 'content-length': '2', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

Conclusion:

This article discussed various Athena data sources, workgroups, and query editors. We also saw how to use Boto3 to perform various Athena operations.

Related Articles