Skip to content

feat: add file size-based chunking to JsonOutput (WIP) #650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

inishchith
Copy link
Member

Changelog

  • Add file size-based chunking to JsonOutput to prevent DAPR gRPC message size limit issues - Implement estimate_dataframe_json_size() function to estimate JSON size of DataFrames - Add byte-level buffer tracking alongside existing record count tracking
  • Automatically flush buffer when approaching DAPR maximum message size (90% threshold)

Additional context (e.g. screenshots, logs, links)

  • to be added

Checklist

  • Additional tests added
  • All CI checks passed
  • Relevant documentation updated

Copyleft License Compliance

  • Have you used any code that is subject to a Copyleft license (e.g., GPL, AGPL, LGPL)?
  • If yes, have you modified the code in the context of this project? please share additional details.

Copy link

snyk-io-us bot commented Aug 13, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

code/snyk check is complete. No issues have been found. (View Details)

Copy link

github-actions bot commented Aug 13, 2025

📜 Docstring Coverage Report

RESULT: PASSED (minimum: 30.0%, actual: 72.9%)

Detailed Coverage Report
======= Coverage for /home/runner/work/application-sdk/application-sdk/ ========
----------------------------------- Summary ------------------------------------
| Name                                                                              | Total | Miss | Cover | Cover% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| application_sdk/__init__.py                                                       |     1 |    0 |     1 |   100% |
| application_sdk/constants.py                                                      |     1 |    0 |     1 |   100% |
| application_sdk/version.py                                                        |     1 |    0 |     1 |   100% |
| application_sdk/worker.py                                                         |     4 |    0 |     4 |   100% |
| application_sdk/activities/__init__.py                                            |     9 |    0 |     9 |   100% |
| application_sdk/activities/common/__init__.py                                     |     1 |    1 |     0 |     0% |
| application_sdk/activities/common/models.py                                       |     2 |    0 |     2 |   100% |
| application_sdk/activities/common/utils.py                                        |     7 |    1 |     6 |    86% |
| application_sdk/activities/metadata_extraction/__init__.py                        |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/rest.py                            |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/sql.py                             |    16 |    1 |    15 |    94% |
| application_sdk/activities/query_extraction/__init__.py                           |     1 |    1 |     0 |     0% |
| application_sdk/activities/query_extraction/sql.py                                |    13 |    1 |    12 |    92% |
| application_sdk/application/__init__.py                                           |    10 |    3 |     7 |    70% |
| application_sdk/application/metadata_extraction/sql.py                            |     7 |    1 |     6 |    86% |
| application_sdk/clients/__init__.py                                               |     4 |    0 |     4 |   100% |
| application_sdk/clients/atlan_auth.py                                             |    10 |    0 |    10 |   100% |
| application_sdk/clients/atlan_client.py                                           |     3 |    2 |     1 |    33% |
| application_sdk/clients/sql.py                                                    |    15 |    0 |    15 |   100% |
| application_sdk/clients/temporal.py                                               |    20 |    1 |    19 |    95% |
| application_sdk/clients/utils.py                                                  |     2 |    1 |     1 |    50% |
| application_sdk/clients/workflow.py                                               |     9 |    2 |     7 |    78% |
| application_sdk/common/__init__.py                                                |     1 |    1 |     0 |     0% |
| application_sdk/common/aws_utils.py                                               |     4 |    1 |     3 |    75% |
| application_sdk/common/credential_utils.py                                        |     4 |    0 |     4 |   100% |
| application_sdk/common/dapr_utils.py                                              |     2 |    1 |     1 |    50% |
| application_sdk/common/dataframe_utils.py                                         |     2 |    1 |     1 |    50% |
| application_sdk/common/error_codes.py                                             |    14 |    2 |    12 |    86% |
| application_sdk/common/utils.py                                                   |    12 |    2 |    10 |    83% |
| application_sdk/docgen/__init__.py                                                |     5 |    2 |     3 |    60% |
| application_sdk/docgen/exporters/__init__.py                                      |     1 |    1 |     0 |     0% |
| application_sdk/docgen/exporters/mkdocs.py                                        |     7 |    3 |     4 |    57% |
| application_sdk/docgen/models/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/__init__.py                                  |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/page.py                                      |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/__init__.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/customer.py                                |     3 |    1 |     2 |    67% |
| application_sdk/docgen/models/manifest/internal.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/metadata.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/page.py                                    |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/section.py                                 |     2 |    1 |     1 |    50% |
| application_sdk/docgen/parsers/__init__.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/docgen/parsers/directory.py                                       |    13 |    2 |    11 |    85% |
| application_sdk/docgen/parsers/manifest.py                                        |     6 |    1 |     5 |    83% |
| application_sdk/handlers/__init__.py                                              |     6 |    1 |     5 |    83% |
| application_sdk/handlers/sql.py                                                   |    16 |    3 |    13 |    81% |
| application_sdk/inputs/__init__.py                                                |     6 |    1 |     5 |    83% |
| application_sdk/inputs/iceberg.py                                                 |     7 |    3 |     4 |    57% |
| application_sdk/inputs/json.py                                                    |     8 |    2 |     6 |    75% |
| application_sdk/inputs/objectstore.py                                             |     7 |    1 |     6 |    86% |
| application_sdk/inputs/parquet.py                                                 |     8 |    1 |     7 |    88% |
| application_sdk/inputs/secretstore.py                                             |     6 |    1 |     5 |    83% |
| application_sdk/inputs/sql_query.py                                               |    11 |    1 |    10 |    91% |
| application_sdk/inputs/statestore.py                                              |     6 |    3 |     3 |    50% |
| application_sdk/observability/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/observability/logger_adaptor.py                                   |    31 |    3 |    28 |    90% |
| application_sdk/observability/metrics_adaptor.py                                  |    16 |    2 |    14 |    88% |
| application_sdk/observability/observability.py                                    |    24 |    1 |    23 |    96% |
| application_sdk/observability/traces_adaptor.py                                   |    16 |    1 |    15 |    94% |
| application_sdk/observability/utils.py                                            |     4 |    1 |     3 |    75% |
| application_sdk/observability/decorators/observability_decorator.py               |     7 |    4 |     3 |    43% |
| application_sdk/outputs/__init__.py                                               |     9 |    0 |     9 |   100% |
| application_sdk/outputs/atlan_storage.py                                          |     5 |    0 |     5 |   100% |
| application_sdk/outputs/eventstore.py                                             |    14 |    9 |     5 |    36% |
| application_sdk/outputs/iceberg.py                                                |     5 |    1 |     4 |    80% |
| application_sdk/outputs/json.py                                                   |     9 |    1 |     8 |    89% |
| application_sdk/outputs/objectstore.py                                            |     5 |    1 |     4 |    80% |
| application_sdk/outputs/parquet.py                                                |     8 |    1 |     7 |    88% |
| application_sdk/outputs/secretstore.py                                            |     3 |    1 |     2 |    67% |
| application_sdk/outputs/statestore.py                                             |     4 |    1 |     3 |    75% |
| application_sdk/server/__init__.py                                                |     4 |    0 |     4 |   100% |
| application_sdk/server/fastapi/__init__.py                                        |    22 |    4 |    18 |    82% |
| application_sdk/server/fastapi/models.py                                          |    25 |   25 |     0 |     0% |
| application_sdk/server/fastapi/utils.py                                           |     2 |    0 |     2 |   100% |
| application_sdk/server/fastapi/middleware/logmiddleware.py                        |     4 |    4 |     0 |     0% |
| application_sdk/server/fastapi/middleware/metrics.py                              |     3 |    3 |     0 |     0% |
| application_sdk/server/fastapi/routers/__init__.py                                |     1 |    1 |     0 |     0% |
| application_sdk/server/fastapi/routers/server.py                                  |     8 |    2 |     6 |    75% |
| application_sdk/test_utils/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/workflow_monitoring.py                                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/e2e/__init__.py                                        |    14 |    2 |    12 |    86% |
| application_sdk/test_utils/e2e/base.py                                            |    16 |    2 |    14 |    88% |
| application_sdk/test_utils/e2e/client.py                                          |    10 |    2 |     8 |    80% |
| application_sdk/test_utils/e2e/conftest.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/e2e/utils.py                                           |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/__init__.py                                 |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/__init__.py                      |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/sql_client.py                    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/temporal.py                      |     6 |    1 |     5 |    83% |
| application_sdk/test_utils/hypothesis/strategies/clients/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/clients/sql.py                   |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/logger.py                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/hypothesis/strategies/handlers/__init__.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/__init__.py         |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_metadata.py     |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_preflight.py    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/json_input.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/json_output.py           |     2 |    1 |     1 |    50% |
| application_sdk/test_utils/hypothesis/strategies/outputs/statestore.py            |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/strategies/server/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/server/fastapi/__init__.py       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/__init__.py                       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/config_loader.py                  |    10 |    4 |     6 |    60% |
| application_sdk/test_utils/scale_data_generator/data_generator.py                 |    10 |    3 |     7 |    70% |
| application_sdk/test_utils/scale_data_generator/driver.py                         |     3 |    3 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/__init__.py        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/base.py            |     7 |    3 |     4 |    57% |
| application_sdk/test_utils/scale_data_generator/output_handler/csv_handler.py     |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/json_handler.py    |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/parquet_handler.py |     6 |    6 |     0 |     0% |
| application_sdk/transformers/__init__.py                                          |     3 |    1 |     2 |    67% |
| application_sdk/transformers/atlas/__init__.py                                    |     6 |    1 |     5 |    83% |
| application_sdk/transformers/atlas/sql.py                                         |    25 |    4 |    21 |    84% |
| application_sdk/transformers/common/__init__.py                                   |     1 |    1 |     0 |     0% |
| application_sdk/transformers/common/utils.py                                      |     6 |    0 |     6 |   100% |
| application_sdk/transformers/query/__init__.py                                    |    11 |    2 |     9 |    82% |
| application_sdk/workflows/__init__.py                                             |     4 |    0 |     4 |   100% |
| application_sdk/workflows/metadata_extraction/__init__.py                         |     2 |    2 |     0 |     0% |
| application_sdk/workflows/metadata_extraction/sql.py                              |     8 |    0 |     8 |   100% |
| application_sdk/workflows/query_extraction/__init__.py                            |     2 |    2 |     0 |     0% |
| application_sdk/workflows/query_extraction/sql.py                                 |     4 |    0 |     4 |   100% |
| examples/application_custom_fastapi.py                                            |    14 |   14 |     0 |     0% |
| examples/application_fastapi.py                                                   |     9 |    9 |     0 |     0% |
| examples/application_hello_world.py                                               |     7 |    7 |     0 |     0% |
| examples/application_sql.py                                                       |     5 |    4 |     1 |    20% |
| examples/application_sql_miner.py                                                 |     5 |    4 |     1 |    20% |
| examples/application_sql_with_custom_pyatlan_transformer.py                       |    11 |    9 |     2 |    18% |
| examples/application_sql_with_custom_transformer.py                               |     9 |    8 |     1 |    11% |
| examples/run_examples.py                                                          |     2 |    1 |     1 |    50% |
| tests/__init__.py                                                                 |     1 |    1 |     0 |     0% |
| tests/conftest.py                                                                 |     2 |    0 |     2 |   100% |
| tests/unit/__init__.py                                                            |     1 |    1 |     0 |     0% |
| tests/unit/test_worker.py                                                         |    10 |    5 |     5 |    50% |
| tests/unit/activities/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/activities/test_activities.py                                          |    34 |    0 |    34 |   100% |
| tests/unit/activities/common/__init__.py                                          |     1 |    1 |     0 |     0% |
| tests/unit/activities/common/test_utils.py                                        |    22 |   10 |    12 |    55% |
| tests/unit/activities/metadata_extraction/__init__.py                             |     1 |    1 |     0 |     0% |
| tests/unit/activities/metadata_extraction/test_sql.py                             |    35 |   34 |     1 |     3% |
| tests/unit/activities/query_extraction/__init__.py                                |     1 |    1 |     0 |     0% |
| tests/unit/application/__init__.py                                                |     1 |    1 |     0 |     0% |
| tests/unit/application/test_application.py                                        |    28 |    2 |    26 |    93% |
| tests/unit/application/metadata_extraction/test_sql.py                            |    30 |    6 |    24 |    80% |
| tests/unit/clients/__init__.py                                                    |     1 |    1 |     0 |     0% |
| tests/unit/clients/test_async_sql_client.py                                       |    13 |   13 |     0 |     0% |
| tests/unit/clients/test_atlan_auth.py                                             |    10 |    0 |    10 |   100% |
| tests/unit/clients/test_atlan_client.py                                           |     4 |    4 |     0 |     0% |
| tests/unit/clients/test_atlanauth.py                                              |    10 |    0 |    10 |   100% |
| tests/unit/clients/test_sql_client.py                                             |    28 |    6 |    22 |    79% |
| tests/unit/clients/test_temporal_client.py                                        |    16 |    3 |    13 |    81% |
| tests/unit/common/test_aws_utils.py                                               |    11 |    1 |    10 |    91% |
| tests/unit/common/test_credential_utils.py                                        |    10 |    1 |     9 |    90% |
| tests/unit/common/test_utils.py                                                   |    74 |    6 |    68 |    92% |
| tests/unit/docgen/parsers/test_directory_parser.py                                |    14 |    3 |    11 |    79% |
| tests/unit/docgen/parsers/test_manifest_parser.py                                 |    12 |   12 |     0 |     0% |
| tests/unit/handlers/__init__.py                                                   |     1 |    1 |     0 |     0% |
| tests/unit/handlers/sql/test_auth.py                                              |    10 |    4 |     6 |    60% |
| tests/unit/handlers/sql/test_check_schemas_and_databases.py                       |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_extract_allowed_schemas.py                           |    11 |    3 |     8 |    73% |
| tests/unit/handlers/sql/test_metadata.py                                          |    27 |   10 |    17 |    63% |
| tests/unit/handlers/sql/test_preflight_check.py                                   |    16 |   15 |     1 |     6% |
| tests/unit/handlers/sql/test_prepare_metadata.py                                  |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_tables_check.py                                      |     9 |    6 |     3 |    33% |
| tests/unit/handlers/sql/test_validate_filters.py                                  |    12 |    4 |     8 |    67% |
| tests/unit/inputs/test_json.py                                                    |    21 |   12 |     9 |    43% |
| tests/unit/observability/__init__.py                                              |     1 |    1 |     0 |     0% |
| tests/unit/observability/test_logger_adaptor.py                                   |    20 |    2 |    18 |    90% |
| tests/unit/observability/test_metrics_adaptor.py                                  |    14 |    1 |    13 |    93% |
| tests/unit/observability/test_traces_adaptor.py                                   |    10 |    1 |     9 |    90% |
| tests/unit/outputs/test_atlanstorage.py                                           |    10 |    0 |    10 |   100% |
| tests/unit/outputs/test_iceberg.py                                                |    11 |    4 |     7 |    64% |
| tests/unit/outputs/test_json_output.py                                            |     7 |    6 |     1 |    14% |
| tests/unit/outputs/test_objectstore.py                                            |     9 |    8 |     1 |    11% |
| tests/unit/outputs/test_output.py                                                 |    19 |    6 |    13 |    68% |
| tests/unit/outputs/test_statestore.py                                             |     6 |    6 |     0 |     0% |
| tests/unit/server/__init__.py                                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/test_fastapi.py                                         |     8 |    3 |     5 |    62% |
| tests/unit/server/fastapi/routers/__init__.py                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/routers/server.py                                       |     1 |    1 |     0 |     0% |
| tests/unit/transformers/__init__.py                                               |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/__init__.py                                         |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/test_column.py                                      |    17 |    6 |    11 |    65% |
| tests/unit/transformers/atlas/test_database.py                                    |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_function.py                                    |     9 |    5 |     4 |    44% |
| tests/unit/transformers/atlas/test_procedure.py                                   |     7 |    6 |     1 |    14% |
| tests/unit/transformers/atlas/test_schema.py                                      |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_table.py                                       |    13 |    6 |     7 |    54% |
| tests/unit/transformers/query/test_sql_transformer.py                             |    14 |    4 |    10 |    71% |
| tests/unit/transformers/query/test_sql_transformer_output_validation.py           |     5 |    2 |     3 |    60% |
| tests/unit/workflows/metadata_extraction/test_sql.py                              |     9 |    4 |     5 |    56% |
| tests/unit/workflows/query_extraction/__init__.py                                 |     1 |    1 |     0 |     0% |
| tests/unit/workflows/query_extraction/test_sql.py                                 |     8 |    3 |     5 |    62% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| TOTAL                                                                             |  1522 |  524 |   998 |  65.6% |
---------------- RESULT: PASSED (minimum: 30.0%, actual: 65.6%) ----------------

@inishchith inishchith force-pushed the feat/add-file-size-based-chunking branch from 3ea6815 to 8987cb7 Compare August 13, 2025 02:41
@inishchith inishchith changed the title fix: remove extra whitespace in estimate_dataframe_json_size feat: add file size-based chunking to JsonOutput (WIP) Aug 13, 2025
Copy link

github-actions bot commented Aug 13, 2025

📦 Trivy Vulnerability Scan Results

Schema Version Created At Artifact Type
2 2025-08-13T02:42:30.723522681Z . filesystem

Report Summary

Target Type Vulnerabilities . filesystem ✅ None found

Scan Result Details

✅ No vulnerabilities found during the scan for ..

Copy link

github-actions bot commented Aug 13, 2025

📦 Trivy Secret Scan Results

Schema Version Created At Artifact Type
2 2025-08-13T02:42:41.755858479Z . filesystem

Report Summary

Target Type Secrets . filesystem ✅ None found

Scan Result Details

✅ No secrets found during the scan for ..

@atlan-ci
Copy link
Collaborator

atlan-ci commented Aug 13, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
5325 3585 67% 0% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 8987cb7 by action🐍

@atlan-ci
Copy link
Collaborator

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
5325 3588 67% 0% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
application_sdk/outputs/json.py 43% 🟢
TOTAL 43% 🟢

updated for commit: 3ea6815 by action🐍

Copy link

github-actions bot commented Aug 13, 2025

@atlan-ci
Copy link
Collaborator

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
5325 3588 67% 0% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
application_sdk/outputs/json.py 43% 🟢
TOTAL 43% 🟢

updated for commit: 3ea6815 by action🐍

Copy link

github-actions bot commented Aug 13, 2025

🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/650

1 similar comment
Copy link

🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/650

self.buffer.append(chunk)
self.current_buffer_size += len(chunk)
self.current_buffer_size_bytes += chunk_size_bytes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: JsonOutput Buffer Flush Fails for Oversized Initial Chunks

The JsonOutput's size-based flush condition (current_buffer_size > 0) prevents flushing when the buffer is empty, even if the first chunk's estimated size exceeds max_file_size_bytes. This results in oversized files being written, which then fail to upload via ObjectStoreOutput.push_file_to_object_store due to DAPR gRPC message size limits. The current logic does not split an oversized single chunk.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants