Skip to content

chore: add script to generate _read_gbq_colab BigQuery benchmark tables #1846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jun 24, 2025

Edit: Sorry, this is a draft. I'll ping you when it's ready for review. Ready!

This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics.

Key features:

  • Dynamically generates table schemas to match target average row sizes, maximizing data type diversity.
  • Generates random data for each table, respecting BigQuery data types.
  • Includes placeholders for GCP project and dataset IDs.
  • Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs).
  • Adds a specific requirements file for this script: scripts/requirements-create_tables.txt.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Towards internal issue b/420984164 🦕

This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics.

Key features:
- Dynamically generates table schemas to match target average row sizes, maximizing data type diversity.
- Generates random data for each table, respecting BigQuery data types.
- Includes placeholders for GCP project and dataset IDs.
- Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs).
- Adds a specific requirements file for this script: `scripts/requirements-create_tables.txt`.
@tswast tswast requested review from a team as code owners June 24, 2025 14:59
@tswast tswast requested a review from sycai June 24, 2025 14:59
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label Jun 24, 2025
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Jun 24, 2025
@tswast tswast closed this Jun 24, 2025
@tswast tswast reopened this Jun 24, 2025
@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jun 24, 2025
@tswast tswast closed this Jun 24, 2025
Vectorized the `generate_random_data` function in
`scripts/create_read_gbq_colab_benchmark_tables.py`.

Changes include:
- Using NumPy's vectorized operations (`size` parameter in random
  functions, `np.vectorize`) to generate arrays of random values for
  most data types at once.
- Employing list comprehensions for transformations on these arrays (e.g.,
  formatting dates, generating strings from character arrays).
- Retaining loops for types where full vectorization is overly complex
  or offers little benefit (e.g., precise byte-length JSON strings, BYTES
  generation via `rng.bytes`).
- Assembling the final list of row dictionaries from the generated
  columnar data.

This should improve performance for data generation, especially for
tables with a large number of rows.
@tswast tswast reopened this Jun 24, 2025
@tswast tswast removed the request for review from sycai June 24, 2025 15:18
@tswast tswast marked this pull request as draft June 24, 2025 15:18
google-labs-jules bot and others added 8 commits June 24, 2025 15:28
Refactored the script to process data in batches, significantly
improving memory efficiency for large tables.

Changes include:

1.  `generate_random_data` function:
    *   Modified to be a generator, yielding data in chunks of a
        specified `batch_size`.
    *   The core vectorized logic for creating column data within each
        batch is retained.

2.  `create_and_load_table` function:
    *   Updated to consume data from the `generate_random_data` generator.
    *   No longer accepts a full list of data rows.
    *   For actual BigQuery loads, it iterates through generated batches
        and further sub-batches them (if necessary) for optimal
        `client.insert_rows_json` calls.
    *   Simulation mode now reflects this batched processing by showing
        details of the first generated batch and estimated total batches.

3.  `main` function:
    *   Removed pre-generation of the entire dataset or a capped sample.
    *   The call to `create_and_load_table` now passes parameters required
        for it to invoke and manage the data generator (total `num_rows`,
        `rng` object, and `DATA_GENERATION_BATCH_SIZE`).
Refactored the `generate_random_data` function to use
`numpy.datetime_as_string` for converting `numpy.datetime64` arrays
to ISO-formatted strings for DATETIME and TIMESTAMP columns.

- For DATETIME:
    - Python `datetime.datetime` objects are created in a list first
      (to ensure date component validity) then converted to
      `numpy.datetime64[us]`.
    - `numpy.datetime_as_string` is used, and the output 'T' separator
      is replaced with a space.
- For TIMESTAMP:
    - `numpy.datetime64[us]` arrays are constructed directly from epoch
      seconds and microsecond offsets.
    - `numpy.datetime_as_string` is used with `timezone='UTC'` to
      produce a 'Z'-suffixed UTC string.

This change improves performance and code clarity for generating these
timestamp string formats.
Implemented command-line arguments for specifying Google Cloud Project ID
and BigQuery Dataset ID, replacing hardcoded global constants.

Changes:
- Imported `argparse` module.
- Added optional `--project_id` (-p) and `--dataset_id` (-d) arguments
  to `main()`.
- If `project_id` or `dataset_id` are not provided, the script defaults
  to simulation mode.
- `create_and_load_table` now checks for the presence of both IDs to
  determine if it should attempt actual BigQuery operations or run in
  simulation.
- Error handling in `create_and_load_table` for BQ operations was
  adjusted to log errors per table and continue processing remaining
  tables, rather than halting the script.
Added unit tests for `get_bq_schema` and `generate_random_data`
functions in `create_read_gbq_colab_benchmark_tables.py`.

- Created `scripts/create_read_gbq_colab_benchmark_tables_test.py`.
- Implemented pytest-style tests covering various scenarios:
    - For `get_bq_schema`:
        - Zero and small target byte sizes.
        - Exact fits with fixed-size types.
        - Inclusion and expansion of flexible types.
        - Generation of all fixed types where possible.
        - Uniqueness of column names.
        - Helper function `_calculate_row_size` used for validation.
    - For `generate_random_data`:
        - Zero rows case.
        - Basic schema and batching logic (single batch, multiple full
          batches, partial last batches).
        - Generation of all supported data types, checking Python types,
          string formats (using regex and `fromisoformat`),
          lengths for string/bytes, and JSON validity.
- Added `pytest` and `pandas` (for pytest compatibility in the current project environment) to `scripts/requirements-create_tables.txt`.
- All tests pass.
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jun 24, 2025
@tswast
Copy link
Collaborator Author

tswast commented Jun 24, 2025

Testing manually with

python scripts/create_read_gbq_colab_benchmark_tables.py --project_id=swast-scratch --dataset_id=read_gbq_colab_benchmark

So far, so good.

@tswast tswast marked this pull request as ready for review June 24, 2025 19:10
@tswast tswast requested a review from sycai June 24, 2025 19:10
@tswast tswast assigned sycai and unassigned TrevorBergeron Jun 24, 2025
Comment on lines +218 to +262
years = rng.integers(1, 10000, size=current_batch_size)
months = rng.integers(1, 13, size=current_batch_size)
days_val = rng.integers(
1, 29, size=current_batch_size
) # Simplified day generation
hours = rng.integers(0, 24, size=current_batch_size)
minutes = rng.integers(0, 60, size=current_batch_size)
seconds = rng.integers(0, 60, size=current_batch_size)
microseconds = rng.integers(0, 1000000, size=current_batch_size)

# Construct Python datetime objects then convert to numpy.datetime64 for string conversion
py_datetimes = []
for i in range(current_batch_size):
try:
py_datetimes.append(
datetime.datetime(
years[i],
months[i],
days_val[i],
hours[i],
minutes[i],
seconds[i],
microseconds[i],
)
)
except ValueError: # Fallback for invalid date component combinations
py_datetimes.append(
datetime.datetime(
2000,
1,
1,
hours[i],
minutes[i],
seconds[i],
microseconds[i],
)
)

np_datetimes = np.array(py_datetimes, dtype="datetime64[us]")
# np.datetime_as_string produces 'YYYY-MM-DDTHH:MM:SS.ffffff'
# BQ DATETIME typically uses a space separator: 'YYYY-MM-DD HH:MM:SS.ffffff'
datetime_strings = np.datetime_as_string(np_datetimes, unit="us")
columns_data_batch[col_name] = np.array(
[s.replace("T", " ") for s in datetime_strings]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: It seems this chunk of logic is complicated enough to merit its own helper function. It may also make the entire hosting function more readable.

Comment on lines +265 to +301
# Generate seconds from a broad range (e.g., year 1 to 9999)
# Note: Python's datetime.timestamp() might be limited by system's C mktime.
# For broader range with np.datetime64, it's usually fine.
# Let's generate epoch seconds relative to Unix epoch for np.datetime64 compatibility
min_epoch_seconds = int(
datetime.datetime(
1, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc
).timestamp()
)
# Max for datetime64[s] is far out, but let's bound it reasonably for BQ.
max_epoch_seconds = int(
datetime.datetime(
9999, 12, 28, 23, 59, 59, tzinfo=datetime.timezone.utc
).timestamp()
)

epoch_seconds = rng.integers(
min_epoch_seconds,
max_epoch_seconds + 1,
size=current_batch_size,
dtype=np.int64,
)
microseconds_offset = rng.integers(
0, 1000000, size=current_batch_size, dtype=np.int64
)

# Create datetime64[s] from epoch seconds and add microseconds as timedelta64[us]
np_timestamps_s = epoch_seconds.astype("datetime64[s]")
np_microseconds_td = microseconds_offset.astype("timedelta64[us]")
np_timestamps_us = np_timestamps_s + np_microseconds_td

# Convert to string with UTC timezone indicator
# np.datetime_as_string with timezone='UTC' produces 'YYYY-MM-DDTHH:MM:SS.ffffffZ'
# BigQuery generally accepts this for TIMESTAMP.
columns_data_batch[col_name] = np.datetime_as_string(
np_timestamps_us, unit="us", timezone="UTC"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a helper function too?


def test_get_bq_schema_one_byte():
schema = get_bq_schema(1)
assert len(schema) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: it's recommended to place an empty line before the assertions in each test function to demarcate arrange/act/assert blocks. It makes it more straightforward what action is being tested.

go/unit-testing-practices?polyglot=python#structure.

Comment on lines +84 to +92
string_cols = [s for s in schema if s[1] == "STRING"]
bytes_cols = [s for s in schema if s[1] == "BYTES"]
json_cols = [s for s in schema if s[1] == "JSON"]
assert len(string_cols) == 1
assert len(bytes_cols) == 1
assert len(json_cols) == 1
assert string_cols[0][2] == 0
assert bytes_cols[0][2] == 0
assert json_cols[0][2] == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we can re-organize the code into this:

string_cols = ...
assert len(string_cols) == 1
assert string_cols[0][2] == 0

bytes_cols = ...
assert len(byte_cols)...
...

This is from Java best practice: go/java-style#s4.8.2.2-variables-limited-scope but I often find it useful for other languages too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants