Skip to content

[Bug]: Severe performance degradation after upgrading from 1.1.4 to 1.2.0 #3953

@StijnvanderLippe

Description

@StijnvanderLippe

What happened?

After upgrading to version 1.2.0 of the Python package, I've noticed severely increased runtimes for loading data from Delta tables in Azure Blob Storage. Version 1.1.4 works as expected.

I've ran a profiler on the code. The logs show increased runtimes for

  • {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}
  • {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}

Looking through the commit history, I've narrowed it down to changes made in one of these commits:

Expected behavior

Loading data from Delta tables should not be significantly slower between versions without clear indicators why this would be the case.

Operating System

Linux

Binding

Python

Bindings Version

1.2.0

Steps to reproduce

The data in question is partitioned by NAME. Additionally, I filter on the timestamp of the datapoints in the TS column. I have several partitions I need to fetch, which I manually loop over. I know you can fetch all the desired partitions in one go, but this manual looping is necessary since not all the data fits in memory at the same time on the system running this code. Hence, I need to fetch one partition at a time, do some post-processing and store the processed results, before fetching the next partition.

I've extracted the core parts of the data loading into the script below. Certain parts are missing but these have no effect on the impacted methods.

from azure.identity import DefaultAzureCredential
from deltalake import DeltaTable
from pyarrow.compute import field

credential = DefaultAzureCredential()

table_uri = f"abfss://{container_name}@{storage_account}.dfs.core.windows.net{path}"
delta_token = credential.get_token("https://storage.azure.com/").token
table = DeltaTable(table_uri, storage_options={"token": delta_token}, version=version)

# Manual loop over the partitions
partitions_to_fetch = [...]
for partition in partitions_to_fetch:
    partitions = [("NAME", "in", partition)]
    filters = (field("TS") >= start_date) & (field("TS") <= end_date)
    df = delta_table.to_pandas(partitions=partitions, columns=columns, filters=filters)

Commit 8ebe4dd:

         2481571 function calls (2424416 primitive calls) in 136.858 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9   76.127    8.459   76.127    8.459 {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
        9   29.083    3.231   29.436    3.271 {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}
        2   17.537    8.769   17.537    8.769 table.py:138(__init__)
        9    4.803    0.534  111.266   12.363 table.py:960(to_pyarrow_table)

Commit 7b9f2f8:

         2482108 function calls (2424944 primitive calls) in 38.173 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   19.285    9.643   19.285    9.643 table.py:138(__init__)
        9    4.611    0.512   11.245    1.249 table.py:910(to_pyarrow_table)
        9    3.002    0.334    3.356    0.373 {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}
        9    2.349    0.261    2.349    0.261 {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
        9    0.801    0.089    0.801    0.089 {built-in method from_table}

Note the significant differences for the lines {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects} and {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}

Relevant logs

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions