chore: add script to generate `_read_gbq_colab` BigQuery benchmark tables #1846

tswast · 2025-06-24T14:59:21Z

~~Edit: Sorry, this is a draft. I'll ping you when it's ready for review.~~ Ready!

This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics.

Key features:

Dynamically generates table schemas to match target average row sizes, maximizing data type diversity.
Generates random data for each table, respecting BigQuery data types.
Includes placeholders for GCP project and dataset IDs.
Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs).
Adds a specific requirements file for this script: scripts/requirements-create_tables.txt.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Towards internal issue b/420984164 🦕

This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics. Key features: - Dynamically generates table schemas to match target average row sizes, maximizing data type diversity. - Generates random data for each table, respecting BigQuery data types. - Includes placeholders for GCP project and dataset IDs. - Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs). - Adds a specific requirements file for this script: `scripts/requirements-create_tables.txt`.

scripts/create_read_gbq_colab_benchmark_tables.py

Vectorized the `generate_random_data` function in `scripts/create_read_gbq_colab_benchmark_tables.py`. Changes include: - Using NumPy's vectorized operations (`size` parameter in random functions, `np.vectorize`) to generate arrays of random values for most data types at once. - Employing list comprehensions for transformations on these arrays (e.g., formatting dates, generating strings from character arrays). - Retaining loops for types where full vectorization is overly complex or offers little benefit (e.g., precise byte-length JSON strings, BYTES generation via `rng.bytes`). - Assembling the final list of row dictionaries from the generated columnar data. This should improve performance for data generation, especially for tables with a large number of rows.

Refactored the script to process data in batches, significantly improving memory efficiency for large tables. Changes include: 1. `generate_random_data` function: * Modified to be a generator, yielding data in chunks of a specified `batch_size`. * The core vectorized logic for creating column data within each batch is retained. 2. `create_and_load_table` function: * Updated to consume data from the `generate_random_data` generator. * No longer accepts a full list of data rows. * For actual BigQuery loads, it iterates through generated batches and further sub-batches them (if necessary) for optimal `client.insert_rows_json` calls. * Simulation mode now reflects this batched processing by showing details of the first generated batch and estimated total batches. 3. `main` function: * Removed pre-generation of the entire dataset or a capped sample. * The call to `create_and_load_table` now passes parameters required for it to invoke and manage the data generator (total `num_rows`, `rng` object, and `DATA_GENERATION_BATCH_SIZE`).

Refactored the `generate_random_data` function to use `numpy.datetime_as_string` for converting `numpy.datetime64` arrays to ISO-formatted strings for DATETIME and TIMESTAMP columns. - For DATETIME: - Python `datetime.datetime` objects are created in a list first (to ensure date component validity) then converted to `numpy.datetime64[us]`. - `numpy.datetime_as_string` is used, and the output 'T' separator is replaced with a space. - For TIMESTAMP: - `numpy.datetime64[us]` arrays are constructed directly from epoch seconds and microsecond offsets. - `numpy.datetime_as_string` is used with `timezone='UTC'` to produce a 'Z'-suffixed UTC string. This change improves performance and code clarity for generating these timestamp string formats.

Implemented command-line arguments for specifying Google Cloud Project ID and BigQuery Dataset ID, replacing hardcoded global constants. Changes: - Imported `argparse` module. - Added optional `--project_id` (-p) and `--dataset_id` (-d) arguments to `main()`. - If `project_id` or `dataset_id` are not provided, the script defaults to simulation mode. - `create_and_load_table` now checks for the presence of both IDs to determine if it should attempt actual BigQuery operations or run in simulation. - Error handling in `create_and_load_table` for BQ operations was adjusted to log errors per table and continue processing remaining tables, rather than halting the script.

Added unit tests for `get_bq_schema` and `generate_random_data` functions in `create_read_gbq_colab_benchmark_tables.py`. - Created `scripts/create_read_gbq_colab_benchmark_tables_test.py`. - Implemented pytest-style tests covering various scenarios: - For `get_bq_schema`: - Zero and small target byte sizes. - Exact fits with fixed-size types. - Inclusion and expansion of flexible types. - Generation of all fixed types where possible. - Uniqueness of column names. - Helper function `_calculate_row_size` used for validation. - For `generate_random_data`: - Zero rows case. - Basic schema and batching logic (single batch, multiple full batches, partial last batches). - Generation of all supported data types, checking Python types, string formats (using regex and `fromisoformat`), lengths for string/bytes, and JSON validity. - Added `pytest` and `pandas` (for pytest compatibility in the current project environment) to `scripts/requirements-create_tables.txt`. - All tests pass.

…ark-tables

tswast · 2025-06-24T19:09:13Z

Testing manually with

python scripts/create_read_gbq_colab_benchmark_tables.py --project_id=swast-scratch --dataset_id=read_gbq_colab_benchmark

So far, so good.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

scripts/create_read_gbq_colab_benchmark_tables.py

sycai · 2025-06-24T22:13:22Z

scripts/create_read_gbq_colab_benchmark_tables.py

+                years = rng.integers(1, 10000, size=current_batch_size)
+                months = rng.integers(1, 13, size=current_batch_size)
+                days_val = rng.integers(
+                    1, 29, size=current_batch_size
+                )  # Simplified day generation
+                hours = rng.integers(0, 24, size=current_batch_size)
+                minutes = rng.integers(0, 60, size=current_batch_size)
+                seconds = rng.integers(0, 60, size=current_batch_size)
+                microseconds = rng.integers(0, 1000000, size=current_batch_size)
+
+                # Construct Python datetime objects then convert to numpy.datetime64 for string conversion
+                py_datetimes = []
+                for i in range(current_batch_size):
+                    try:
+                        py_datetimes.append(
+                            datetime.datetime(
+                                years[i],
+                                months[i],
+                                days_val[i],
+                                hours[i],
+                                minutes[i],
+                                seconds[i],
+                                microseconds[i],
+                            )
+                        )
+                    except ValueError:  # Fallback for invalid date component combinations
+                        py_datetimes.append(
+                            datetime.datetime(
+                                2000,
+                                1,
+                                1,
+                                hours[i],
+                                minutes[i],
+                                seconds[i],
+                                microseconds[i],
+                            )
+                        )
+
+                np_datetimes = np.array(py_datetimes, dtype="datetime64[us]")
+                # np.datetime_as_string produces 'YYYY-MM-DDTHH:MM:SS.ffffff'
+                # BQ DATETIME typically uses a space separator: 'YYYY-MM-DD HH:MM:SS.ffffff'
+                datetime_strings = np.datetime_as_string(np_datetimes, unit="us")
+                columns_data_batch[col_name] = np.array(
+                    [s.replace("T", " ") for s in datetime_strings]
+                )


style nit: It seems this chunk of logic is complicated enough to merit its own helper function. It may also make the entire hosting function more readable.

sycai · 2025-06-24T22:14:09Z

scripts/create_read_gbq_colab_benchmark_tables.py

+                # Generate seconds from a broad range (e.g., year 1 to 9999)
+                # Note: Python's datetime.timestamp() might be limited by system's C mktime.
+                # For broader range with np.datetime64, it's usually fine.
+                # Let's generate epoch seconds relative to Unix epoch for np.datetime64 compatibility
+                min_epoch_seconds = int(
+                    datetime.datetime(
+                        1, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc
+                    ).timestamp()
+                )
+                # Max for datetime64[s] is far out, but let's bound it reasonably for BQ.
+                max_epoch_seconds = int(
+                    datetime.datetime(
+                        9999, 12, 28, 23, 59, 59, tzinfo=datetime.timezone.utc
+                    ).timestamp()
+                )
+
+                epoch_seconds = rng.integers(
+                    min_epoch_seconds,
+                    max_epoch_seconds + 1,
+                    size=current_batch_size,
+                    dtype=np.int64,
+                )
+                microseconds_offset = rng.integers(
+                    0, 1000000, size=current_batch_size, dtype=np.int64
+                )
+
+                # Create datetime64[s] from epoch seconds and add microseconds as timedelta64[us]
+                np_timestamps_s = epoch_seconds.astype("datetime64[s]")
+                np_microseconds_td = microseconds_offset.astype("timedelta64[us]")
+                np_timestamps_us = np_timestamps_s + np_microseconds_td
+
+                # Convert to string with UTC timezone indicator
+                # np.datetime_as_string with timezone='UTC' produces 'YYYY-MM-DDTHH:MM:SS.ffffffZ'
+                # BigQuery generally accepts this for TIMESTAMP.
+                columns_data_batch[col_name] = np.datetime_as_string(
+                    np_timestamps_us, unit="us", timezone="UTC"
+                )


Perhaps a helper function too?

sycai · 2025-06-24T22:17:05Z

scripts/create_read_gbq_colab_benchmark_tables_test.py

+
+def test_get_bq_schema_one_byte():
+    schema = get_bq_schema(1)
+    assert len(schema) == 1


super nit: it's recommended to place an empty line before the assertions in each test function to demarcate arrange/act/assert blocks. It makes it more straightforward what action is being tested.

go/unit-testing-practices?polyglot=python#structure.

sycai · 2025-06-24T22:21:09Z

scripts/create_read_gbq_colab_benchmark_tables_test.py

+    string_cols = [s for s in schema if s[1] == "STRING"]
+    bytes_cols = [s for s in schema if s[1] == "BYTES"]
+    json_cols = [s for s in schema if s[1] == "JSON"]
+    assert len(string_cols) == 1
+    assert len(bytes_cols) == 1
+    assert len(json_cols) == 1
+    assert string_cols[0][2] == 0
+    assert bytes_cols[0][2] == 0
+    assert json_cols[0][2] == 1


nit: maybe we can re-organize the code into this:

string_cols = ... assert len(string_cols) == 1 assert string_cols[0][2] == 0 bytes_cols = ... assert len(byte_cols)... ...

This is from Java best practice: go/java-style#s4.8.2.2-variables-limited-scope but I often find it useful for other languages too.

tswast requested review from a team as code owners June 24, 2025 14:59

tswast requested a review from sycai June 24, 2025 14:59

product-auto-label bot added the size: l Pull request size is large. label Jun 24, 2025

blunderbuss-gcf bot assigned Genesis929 Jun 24, 2025

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Jun 24, 2025

tswast closed this Jun 24, 2025

tswast reopened this Jun 24, 2025

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jun 24, 2025

tswast closed this Jun 24, 2025

tswast commented Jun 24, 2025

View reviewed changes

tswast reopened this Jun 24, 2025

tswast removed the request for review from sycai June 24, 2025 15:18

tswast unassigned Genesis929 Jun 24, 2025

tswast marked this pull request as draft June 24, 2025 15:18

google-labs-jules bot and others added 8 commits June 24, 2025 15:28

refactor

b84723e

Merge remote-tracking branch 'origin/main' into feat/create-bq-benchm…

ac6f0c9

…ark-tables

reduce duplicated work

1e64755

only use percentile in table name

91771e7

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jun 24, 2025

use annotations to not fail in 3.9

ebe8f80

tswast marked this pull request as ready for review June 24, 2025 19:10

Merge branch 'main' into feat/create-bq-benchmark-tables

996f9fc

blunderbuss-gcf bot assigned TrevorBergeron Jun 24, 2025

tswast requested a review from sycai June 24, 2025 19:10

tswast assigned sycai and unassigned TrevorBergeron Jun 24, 2025

🦉 Updates from OwlBot post-processor

02db00b

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast commented Jun 24, 2025

View reviewed changes

scripts/create_read_gbq_colab_benchmark_tables.py Outdated Show resolved Hide resolved

tswast added 2 commits June 24, 2025 14:18

Update scripts/create_read_gbq_colab_benchmark_tables.py

b518111

Delete scripts/requirements-create_tables.txt

54f8ee9

tswast commented Jun 24, 2025

View reviewed changes

scripts/create_read_gbq_colab_benchmark_tables.py Outdated Show resolved Hide resolved

base64 encode

ba31911

sycai approved these changes Jun 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: add script to generate `_read_gbq_colab` BigQuery benchmark tables #1846

chore: add script to generate `_read_gbq_colab` BigQuery benchmark tables #1846

Uh oh!

tswast commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tswast commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

sycai Jun 24, 2025

Uh oh!

sycai Jun 24, 2025

Uh oh!

sycai Jun 24, 2025

Uh oh!

sycai Jun 24, 2025

Uh oh!

Uh oh!

chore: add script to generate _read_gbq_colab BigQuery benchmark tables #1846

Are you sure you want to change the base?

chore: add script to generate _read_gbq_colab BigQuery benchmark tables #1846

Uh oh!

Conversation

tswast commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tswast commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

sycai Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

sycai Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

sycai Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

sycai Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chore: add script to generate `_read_gbq_colab` BigQuery benchmark tables #1846

chore: add script to generate `_read_gbq_colab` BigQuery benchmark tables #1846

tswast commented Jun 24, 2025 •

edited

Loading