-
Notifications
You must be signed in to change notification settings - Fork 50
chore: add script to generate _read_gbq_colab
BigQuery benchmark tables
#1846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics. Key features: - Dynamically generates table schemas to match target average row sizes, maximizing data type diversity. - Generates random data for each table, respecting BigQuery data types. - Includes placeholders for GCP project and dataset IDs. - Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs). - Adds a specific requirements file for this script: `scripts/requirements-create_tables.txt`.
Vectorized the `generate_random_data` function in `scripts/create_read_gbq_colab_benchmark_tables.py`. Changes include: - Using NumPy's vectorized operations (`size` parameter in random functions, `np.vectorize`) to generate arrays of random values for most data types at once. - Employing list comprehensions for transformations on these arrays (e.g., formatting dates, generating strings from character arrays). - Retaining loops for types where full vectorization is overly complex or offers little benefit (e.g., precise byte-length JSON strings, BYTES generation via `rng.bytes`). - Assembling the final list of row dictionaries from the generated columnar data. This should improve performance for data generation, especially for tables with a large number of rows.
Refactored the script to process data in batches, significantly improving memory efficiency for large tables. Changes include: 1. `generate_random_data` function: * Modified to be a generator, yielding data in chunks of a specified `batch_size`. * The core vectorized logic for creating column data within each batch is retained. 2. `create_and_load_table` function: * Updated to consume data from the `generate_random_data` generator. * No longer accepts a full list of data rows. * For actual BigQuery loads, it iterates through generated batches and further sub-batches them (if necessary) for optimal `client.insert_rows_json` calls. * Simulation mode now reflects this batched processing by showing details of the first generated batch and estimated total batches. 3. `main` function: * Removed pre-generation of the entire dataset or a capped sample. * The call to `create_and_load_table` now passes parameters required for it to invoke and manage the data generator (total `num_rows`, `rng` object, and `DATA_GENERATION_BATCH_SIZE`).
Refactored the `generate_random_data` function to use `numpy.datetime_as_string` for converting `numpy.datetime64` arrays to ISO-formatted strings for DATETIME and TIMESTAMP columns. - For DATETIME: - Python `datetime.datetime` objects are created in a list first (to ensure date component validity) then converted to `numpy.datetime64[us]`. - `numpy.datetime_as_string` is used, and the output 'T' separator is replaced with a space. - For TIMESTAMP: - `numpy.datetime64[us]` arrays are constructed directly from epoch seconds and microsecond offsets. - `numpy.datetime_as_string` is used with `timezone='UTC'` to produce a 'Z'-suffixed UTC string. This change improves performance and code clarity for generating these timestamp string formats.
Implemented command-line arguments for specifying Google Cloud Project ID and BigQuery Dataset ID, replacing hardcoded global constants. Changes: - Imported `argparse` module. - Added optional `--project_id` (-p) and `--dataset_id` (-d) arguments to `main()`. - If `project_id` or `dataset_id` are not provided, the script defaults to simulation mode. - `create_and_load_table` now checks for the presence of both IDs to determine if it should attempt actual BigQuery operations or run in simulation. - Error handling in `create_and_load_table` for BQ operations was adjusted to log errors per table and continue processing remaining tables, rather than halting the script.
Added unit tests for `get_bq_schema` and `generate_random_data` functions in `create_read_gbq_colab_benchmark_tables.py`. - Created `scripts/create_read_gbq_colab_benchmark_tables_test.py`. - Implemented pytest-style tests covering various scenarios: - For `get_bq_schema`: - Zero and small target byte sizes. - Exact fits with fixed-size types. - Inclusion and expansion of flexible types. - Generation of all fixed types where possible. - Uniqueness of column names. - Helper function `_calculate_row_size` used for validation. - For `generate_random_data`: - Zero rows case. - Basic schema and batching logic (single batch, multiple full batches, partial last batches). - Generation of all supported data types, checking Python types, string formats (using regex and `fromisoformat`), lengths for string/bytes, and JSON validity. - Added `pytest` and `pandas` (for pytest compatibility in the current project environment) to `scripts/requirements-create_tables.txt`. - All tests pass.
Testing manually with
So far, so good. |
years = rng.integers(1, 10000, size=current_batch_size) | ||
months = rng.integers(1, 13, size=current_batch_size) | ||
days_val = rng.integers( | ||
1, 29, size=current_batch_size | ||
) # Simplified day generation | ||
hours = rng.integers(0, 24, size=current_batch_size) | ||
minutes = rng.integers(0, 60, size=current_batch_size) | ||
seconds = rng.integers(0, 60, size=current_batch_size) | ||
microseconds = rng.integers(0, 1000000, size=current_batch_size) | ||
|
||
# Construct Python datetime objects then convert to numpy.datetime64 for string conversion | ||
py_datetimes = [] | ||
for i in range(current_batch_size): | ||
try: | ||
py_datetimes.append( | ||
datetime.datetime( | ||
years[i], | ||
months[i], | ||
days_val[i], | ||
hours[i], | ||
minutes[i], | ||
seconds[i], | ||
microseconds[i], | ||
) | ||
) | ||
except ValueError: # Fallback for invalid date component combinations | ||
py_datetimes.append( | ||
datetime.datetime( | ||
2000, | ||
1, | ||
1, | ||
hours[i], | ||
minutes[i], | ||
seconds[i], | ||
microseconds[i], | ||
) | ||
) | ||
|
||
np_datetimes = np.array(py_datetimes, dtype="datetime64[us]") | ||
# np.datetime_as_string produces 'YYYY-MM-DDTHH:MM:SS.ffffff' | ||
# BQ DATETIME typically uses a space separator: 'YYYY-MM-DD HH:MM:SS.ffffff' | ||
datetime_strings = np.datetime_as_string(np_datetimes, unit="us") | ||
columns_data_batch[col_name] = np.array( | ||
[s.replace("T", " ") for s in datetime_strings] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: It seems this chunk of logic is complicated enough to merit its own helper function. It may also make the entire hosting function more readable.
# Generate seconds from a broad range (e.g., year 1 to 9999) | ||
# Note: Python's datetime.timestamp() might be limited by system's C mktime. | ||
# For broader range with np.datetime64, it's usually fine. | ||
# Let's generate epoch seconds relative to Unix epoch for np.datetime64 compatibility | ||
min_epoch_seconds = int( | ||
datetime.datetime( | ||
1, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc | ||
).timestamp() | ||
) | ||
# Max for datetime64[s] is far out, but let's bound it reasonably for BQ. | ||
max_epoch_seconds = int( | ||
datetime.datetime( | ||
9999, 12, 28, 23, 59, 59, tzinfo=datetime.timezone.utc | ||
).timestamp() | ||
) | ||
|
||
epoch_seconds = rng.integers( | ||
min_epoch_seconds, | ||
max_epoch_seconds + 1, | ||
size=current_batch_size, | ||
dtype=np.int64, | ||
) | ||
microseconds_offset = rng.integers( | ||
0, 1000000, size=current_batch_size, dtype=np.int64 | ||
) | ||
|
||
# Create datetime64[s] from epoch seconds and add microseconds as timedelta64[us] | ||
np_timestamps_s = epoch_seconds.astype("datetime64[s]") | ||
np_microseconds_td = microseconds_offset.astype("timedelta64[us]") | ||
np_timestamps_us = np_timestamps_s + np_microseconds_td | ||
|
||
# Convert to string with UTC timezone indicator | ||
# np.datetime_as_string with timezone='UTC' produces 'YYYY-MM-DDTHH:MM:SS.ffffffZ' | ||
# BigQuery generally accepts this for TIMESTAMP. | ||
columns_data_batch[col_name] = np.datetime_as_string( | ||
np_timestamps_us, unit="us", timezone="UTC" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a helper function too?
|
||
def test_get_bq_schema_one_byte(): | ||
schema = get_bq_schema(1) | ||
assert len(schema) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: it's recommended to place an empty line before the assertions in each test function to demarcate arrange/act/assert blocks. It makes it more straightforward what action is being tested.
go/unit-testing-practices?polyglot=python#structure.
string_cols = [s for s in schema if s[1] == "STRING"] | ||
bytes_cols = [s for s in schema if s[1] == "BYTES"] | ||
json_cols = [s for s in schema if s[1] == "JSON"] | ||
assert len(string_cols) == 1 | ||
assert len(bytes_cols) == 1 | ||
assert len(json_cols) == 1 | ||
assert string_cols[0][2] == 0 | ||
assert bytes_cols[0][2] == 0 | ||
assert json_cols[0][2] == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we can re-organize the code into this:
string_cols = ...
assert len(string_cols) == 1
assert string_cols[0][2] == 0
bytes_cols = ...
assert len(byte_cols)...
...
This is from Java best practice: go/java-style#s4.8.2.2-variables-limited-scope but I often find it useful for other languages too.
Edit: Sorry, this is a draft. I'll ping you when it's ready for review.Ready!This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics.
Key features:
scripts/requirements-create_tables.txt
.Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Towards internal issue b/420984164 🦕