Skip to content

Dbt runs fail after ~350s using Glue 5.0 with llegalSessionStateException #576

@ddafransim

Description

@ddafransim

Describe the Bug

All my dbt model runs crash around 5 minutes and 30 seconds into execution with the following error:

Error in GlueCursor execute: An error occurred (IllegalSessionStateException) when calling the GetStatement operation: Session unavailable, fail to call ReplServer.

The run writes files to an S3 bucket, however, the table does not get registered in the Glue Catalog.

The source tables are registered in the Glue Catalog and configured via sources.yml. Example for the conta table:

- name: conta
  description: "Raw CDC data from DMS for conta"
  meta:
    external_location: s3://zap-ingestion/zapweb/conta/
  external:
    location: "s3://zap-ingestion/zapweb/conta/"
    file_format: "parquet"

These tables are built on top of compacted Parquet files generated by AWS DMS and processed using https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples/compaction . The goal is to build a CDC process and the 1sr run of each table may process from 10MB to 100GB of data.


Steps to Reproduce

dbt Model (brz_conta.sql)

{{ config(
    tags=["brz"],
    materialized='incremental',
    incremental_strategy='append',
    pre_hook="SET spark.sql.parquet.datetimeRebaseModeInRead = LEGACY"
) }}

WITH source_data AS (
    SELECT 
        CAST(dms_commit_ts AS TIMESTAMP) AS loaded_at,
        CAST(idconsumo AS STRING) AS idconsumo,
        CAST(idsubscricao AS STRING) AS idsubscricao,
        CAST(idprodutoinstantaneo AS STRING) AS idprodutoinstantaneo,
        CAST(datainicio AS STRING) AS datainicio,
        CAST(datafim AS STRING) AS datafim,
        CAST(valor AS STRING) AS valor,
        CAST(estado AS STRING) AS estado,
        CAST(nagra AS STRING) AS nagra,
        CAST(incarencia AS STRING) AS incarencia,
        CAST(ideventoconsumo AS STRING) AS ideventoconsumo,
        CAST(pontos AS STRING) AS pontos,
        CAST(quota AS STRING) AS quota,
        CAST(idstb AS STRING) AS idstb,
        CAST(idequipamento AS STRING) AS idequipamento,
        CAST(idcontaservico AS STRING) AS idcontaservico,
        CAST(datahorainicio AS STRING) AS datahorainicio,
        CAST(datahorafim AS STRING) AS datahorafim,
        CAST(datacriacao AS STRING) AS datacriacao
    FROM {{ source('zap_ingestion', 'consumo') }}
    {% if is_incremental() %}
        WHERE loaded_at > (SELECT MAX(loaded_at) FROM {{ this }})
    {% endif %}
)

SELECT * FROM source_data

dbt profiles.yml

prod:
  type: glue
  query_tag: zap-deltalake
  role_arn: ROLE
  region: af-south-1
  workers: 2
  worker_type: G.1X
  session_provisioning_timeout_in_seconds: 300
  idle_timeout: 15
  create_new_session: true
  default_arguments: "--enable-auto-scaling=true, --enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://BUCKET/prodLogDir/"
  glue_version: "5.0"
  schema: ice_lake
  catalog_id: ID
  location: s3://BRZ_BUCKET
  datalake_formats: iceberg
  custom_iceberg_catalog_namespace: ""
  spark_conf:
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.glue_catalog: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.glue_catalog.warehouse: s3://BRZ_BUCKET/ice_lake/
    spark.sql.warehouse: s3://BRZ_BUCKET/ice_lake/
    spark.sql.catalog.glue_catalog.catalog-impl: org.apache.iceberg.aws.glue.GlueCatalog
    spark.sql.catalog.glue_catalog.io-impl: org.apache.iceberg.aws.s3.S3FileIO
    spark.serializer: org.apache.spark.serializer.KryoSerializer

Expected Behavior

The model brz_conta should:

  • Create an Iceberg table named brz_conta in the ice_lake schema.
  • Register it in the Glue Catalog.
  • Store data files in the specified S3 path.

Screenshots and Log Output

DBT Output

Glue adapter: Error in GlueCursor (session_id=dbt-glue__service-role/AWSGlueServiceRole__e0ea176d-fa40-4e95-9709-d4ca72d5508a) execute:
An error occurred (IllegalSessionStateException) when calling the GetStatement operation:
Session dbt-glue__service-role/... unavailable, fail to call ReplServer

Runtime Error in model brz_conta (models/bronze/brz_conta.sql):
module 'dbt.exceptions' has no attribute 'ExecutableError'

AWS Glue Interactive Sessions Logs

There are several successful downloads before the process fails:

25/07/10 13:17:56 ERROR AsyncFileDownloader: TID: 246 - Download failed for ParquetFileChunk(path=s3://zap-ingestion/zapweb/conta/20250527-194357672.parquet, downloadSize=30)
java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.fs.FileSystem.hasPathCapability(org.apache.hadoop.fs.Path, String)" because "this.fileSystem" is null
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.hasPathCapability(EmrFileSystem.java:377) ~[emrfs-hadoop-assembly-2.69.0.jar:?]

ERROR	2025-07-10T13:17:56,872	311398	com.amazonaws.glue.is.LivyServerLauncher	[main]	86	Got interrupted
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.InterruptedException: Session did not reach healthy state

Spark Output Sample

{
  "Event": "SparkListenerJobEnd",
  "Job ID": 3,
  "Completion Time": 1752153476899,
  "Job Result": {
    "Result": "JobFailed",
    "Exception": {
      "Message": "Job 3 cancelled because SparkContext was shut down"
    }
  },
  "Metrics Summary": {
    "Rows Written": 6143958,
    "Bytes Written": 259720476,
    "Input Records Read": 6143958,
    "Bytes Read": 617032200
  }
}

System Information

The output of dbt --version:

dbt-core==1.9.6
dbt-glue==1.9.4

The operating system you're using:

Docker image: ghcr.io/dbt-labs/dbt-spark:latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions