-
Notifications
You must be signed in to change notification settings - Fork 603
[Bug]: Severe performance degradation after upgrading from 1.1.4 to 1.2.0 #3953
Description
What happened?
After upgrading to version 1.2.0 of the Python package, I've noticed severely increased runtimes for loading data from Delta tables in Azure Blob Storage. Version 1.1.4 works as expected.
I've ran a profiler on the code. The logs show increased runtimes for
{method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}{method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
Looking through the commit history, I've narrowed it down to changes made in one of these commits:
Expected behavior
Loading data from Delta tables should not be significantly slower between versions without clear indicators why this would be the case.
Operating System
Linux
Binding
Python
Bindings Version
1.2.0
Steps to reproduce
The data in question is partitioned by NAME. Additionally, I filter on the timestamp of the datapoints in the TS column. I have several partitions I need to fetch, which I manually loop over. I know you can fetch all the desired partitions in one go, but this manual looping is necessary since not all the data fits in memory at the same time on the system running this code. Hence, I need to fetch one partition at a time, do some post-processing and store the processed results, before fetching the next partition.
I've extracted the core parts of the data loading into the script below. Certain parts are missing but these have no effect on the impacted methods.
from azure.identity import DefaultAzureCredential
from deltalake import DeltaTable
from pyarrow.compute import field
credential = DefaultAzureCredential()
table_uri = f"abfss://{container_name}@{storage_account}.dfs.core.windows.net{path}"
delta_token = credential.get_token("https://storage.azure.com/").token
table = DeltaTable(table_uri, storage_options={"token": delta_token}, version=version)
# Manual loop over the partitions
partitions_to_fetch = [...]
for partition in partitions_to_fetch:
partitions = [("NAME", "in", partition)]
filters = (field("TS") >= start_date) & (field("TS") <= end_date)
df = delta_table.to_pandas(partitions=partitions, columns=columns, filters=filters)Commit 8ebe4dd:
2481571 function calls (2424416 primitive calls) in 136.858 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
9 76.127 8.459 76.127 8.459 {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
9 29.083 3.231 29.436 3.271 {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}
2 17.537 8.769 17.537 8.769 table.py:138(__init__)
9 4.803 0.534 111.266 12.363 table.py:960(to_pyarrow_table)
Commit 7b9f2f8:
2482108 function calls (2424944 primitive calls) in 38.173 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2 19.285 9.643 19.285 9.643 table.py:138(__init__)
9 4.611 0.512 11.245 1.249 table.py:910(to_pyarrow_table)
9 3.002 0.334 3.356 0.373 {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects}
9 2.349 0.261 2.349 0.261 {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
9 0.801 0.089 0.801 0.089 {built-in method from_table}
Note the significant differences for the lines {method 'dataset_partitions' of 'deltalake._internal.RawDeltaTable' objects} and {method 'get_add_file_sizes' of 'deltalake._internal.RawDeltaTable' objects}
Relevant logs
Metadata
Metadata
Assignees
Labels
Type
Projects
Status