feat(bigquery): add job_id_prefix to functions that produce bigquery loadjobs #11265

dtran-im · 2025-05-29T18:41:31Z

Description of changes

I realized that my changes in #11233 and #11237 I completely focused on directly-created query jobs and completely overlooked functions which create bigquery load jobs and other queries, mainly in the process of uploading data. Granted, the directly-created query jobs comprise the bulk of what I am concerned with, but it seemed proper to include other types of queries in this feature.

However, this has proved a bit stickier than anticipated. The main situation I wasn't quite sure how to address was where one function call potentially creates several bigquery jobs, such as a job to truncate or drop a table if it already exists, a job to upload the data to a (temporary?) table in bq, and a job to select everything from that table and insert it into the target table. I chose to address this with more specificity rather than less, creating kwargs such as drop_job_id_prefix and insert_job_id_prefix, and have attempted to be consistent about applying that, but I could be convinced that this is overkill and we should just pass along one job_id_prefix for any jobs stemming from a certain ibis function call. (This would be ok for my purposes at least.)

I believe I have added the kwargs to all relevant methods consistently, but it is worth a close review. A notable exception is I did not add the kwargs to the _run_pre_execute_hooks call from create_view, since I assume that should not be creating a load job to upload data since it's creating a view, but that assumption could use verification.

It's notable that this required essentially copying the parent class' insert, register_in_memory_tables, and other methods in order to pass down a job_id_prefix variable.

In addition, since these methods do not return anything - there are no "results" from a drop table query that you would generally desire to return - I'm not immediately seeing a straightforward way to write unit tests for these. But I have yet to explore some options.

Issues closed

Resolves feat: Bigquery - custom job IDs #11229

dtran-im · 2025-05-29T20:40:26Z

@cpcloud Do you have any insights on how I might make tests for this? I fiddled a bit with patch & unittest.mock, but I haven't done much with those tools the past and any potential solution seemed exceedingly tortured.

dtran-im · 2025-05-30T15:20:33Z

@cpcloud I added one test that I think may work, but I realized it's in the /system/ tests which don't seem to get kicked off automatically. Would you be able to kick those off?

cpcloud · 2025-06-01T15:47:27Z

Thanks for the PR!

Is there a way we can centralize the API here?

I worry that every time there's a new need to customize some aspect of jobs, we'll have to modify N methods/functions' keyword arguments.

How are you producing the prefixes? Can we have some kind of callable that generates these that can be passed to ibis.bigquery.connect instead? Then, those callables would be internally passed around. That doesn't eliminate the problem of making sure to pass them wherever jobs are created, but it does central the customization of the IDs to a single call site.

dtran-im · 2025-06-02T13:14:30Z

Can we have some kind of callable that generates these that can be passed to ibis.bigquery.connect instead? Then, those callables would be internally passed around. That doesn't eliminate the problem of making sure to pass them wherever jobs are created, but it does central the customization of the IDs to a single call site.

Hmm, that's a good idea. In our current system we wrap the bq Client.query call with a function that reads the sql from a file, generates the job ID based on the sql filename and does some other stuff... I will need to think about this a bit though.

dtran-im · 2025-06-03T21:16:59Z

@cpcloud I've implemented the solution you suggested, though it feels a bit funny overwriting a class method inside a class method, if you think there's a better way of doing this let me know.

The way I implemented it, we lose the ability to specify a particular job_id_prefix for a certain query in a more identifying way ("insert_items_into_table_query_" etc.), but I'm the one who requested this feature and just being able to generate and log a random uuid before the job kicks off suits my purposes.

ibis/backends/bigquery/__init__.py

dlstadther · 2025-06-09T13:44:19Z

@cpcloud Do you have additional feedback for this PR?

ibis/backends/bigquery/__init__.py

cpcloud · 2025-06-10T14:49:04Z

ibis/backends/bigquery/tests/system/test_client.py

+    con3.client.load_table_from_file = load_table_from_file
+
+    orig_query = con3.client.query
+    con3.client._query_num_calls = 0


Is there an issue with using mocker here? You can add it as a fixture input to the test function, e.g., test_read_csv_with_custom_load_job_prefix(con3, mocker):, and then use it like this:

query_spy = mocker.spy(con3.client, "query") # do stuff that is supposed to invoke the `query` method query_spy.assert_called_once() # or whatever assertion method suits your testing use case

I'll try that - I haven't used mocking much in unit tests before and wasn't quite sure how to achieve this

I attempted to implement the mocker.spy and added a similar test for the insert method. Let me know if there's anything more to do here.

cpcloud · 2025-06-10T18:32:48Z

LGTM, I'll kick off the test suite now.

dtran-im · 2025-06-10T20:56:36Z

Ok, so I see a timeout, a memtable cleanup error for exasol, "feature not exposed in athena" errors, some 500 server errors, and some unknown operational errors among the failures. I don't think these are relevant to my PR.

For the BQ failures though, I see a lot of tests failing due to ValueError: Length mismatch: Expected axis has 1 elements, new values have 0 elements. TL;DR if my changes have broken things, I can't figure out how they possibly did.

This seems to stem from this statement in the execute function: df.columns = schema.names
Which the schema with zero elements comes from earlier in the execute function:

table_expr = expr.as_table()
schema = table_expr.schema() - ibis.schema({"_TABLE_SUFFIX": "string"})

I confirmed that this is the culprit, by imitating the test code and what execute runs, where dtran.tmp is a table that already exists in bq and contains data in the project specified:

>>> con = bigquery.connect(project_id="****")
>>> table = "dtran.tmp"
>>> limit = "LIMIT 10"
>>> expr = con.sql(f"SELECT * FROM {table} {limit}")
>>> expr.as_table()
SQLQueryResult
  query:
    SELECT * FROM dana_tran.tmp LIMIT 10
  schema:
    <empty schema>
>>> expr.as_table().schema()
ibis.Schema {
}

That's about as far as I've been able to get with my troubleshooting - I don't see any recent changes to .as_table() or .schema(). The bigquery backend method get_schema seems to work as intended:

>>> con.get_schema("tmp", database="dtran")
ibis.Schema {
  column_1  int64
  column_2  string
}

Is there a way to verify whether my changes are the culprit here? (Do these tests pass on main currently?) Thanks.

cpcloud · 2025-06-11T10:58:59Z

ibis/backends/bigquery/__init__.py

@@ -670,16 +749,14 @@ def _make_session(self) -> tuple[str, str]:
        return None

    def _get_schema_using_query(self, query: str) -> sch.Schema:
-        job = self.client.query(
+        job = self._client_query(


The issue is here. _get_schema_using_query needs access to a bq.QueryJob, but _client_query always returns a RowIterator.

I'll fix this up!

Ahh thank you, I missed that!

cpcloud · 2025-06-11T12:47:23Z

I will run the BigQuery tests locally to avoid thrashing CI, and I will post the results here.

…loadjobs

…es of jobs

…sing_query`

…ry lifetime is at least as long as the test

cpcloud · 2025-06-11T14:08:07Z

Fixed up a couple issues in the tests:

I don't think it was necessary to track the call count difference, since we're just testing that the most recent call (which is what call_args tracks) contains the job_id_prefix that we set.
The use of mktempd was incorrect. When mktempd is used as a context manager, the temporary directory is removed when the context manager exits, so you can't use any files created in that directory after the context manager exits, because the files won't be there. The solution is to use the tmpdir fixture, which ships with a standard pytest install. This tmpdir is bound to the lifetime of the test run, i.e., when the test exits for any reason, tmpdir is removed. tmpdir is never removed before the test finishes, so it's safe to rely on its contents existing during the test's execution.

cpcloud · 2025-06-11T14:11:04Z

This is now passing, so I'll merge. Thanks for your work @dtran-im!

cloud in 🌐 falcon in …/ibis on  bigquery-custom-job-id-prefix-for-loadjobs is 📦 v10.5.0 via 🐍 v3.13.3 via ❄️  impure (ibis-3.13-env) took 4m59s
❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
x....x.x.............x.......x..x..x..................x.......xxx.........x..x.....x...........x....x........x..x................x.....sssssssssssssssssssssssssssssssssssssssssssssssssssssss [  8%]
ssssssssssssssssssssssssssssssssssssssssssss....xx........x....sssssssssssssssssssss....x.x.xx................x..............xxx....................x.x.....x.x..............x..x............. [ 16%]
.........................x.x...x.....x.x....x...xx.....x.....x..x.......x.................x.x.........x.....X.xx.x.x....x......xX..x..x......X.xx.x.x....xxx.......X............x...x...xx..x. [ 24%]
.x.....x.x........xx...x...x....x....x.............x..x.....x.x................x..x...x.x....x...xx..xx.xx....x.........xxxx..xx...x..x.....xxxx...x...x.............xx..x...........xx....x.. [ 32%]
...xx..x............x..x.xx..x............x.x.xx...sxxx.......x.x...xx..................xxxx.....x............xx.....s..................x.....................s.......x..........s............ [ 41%]
.....................................x....xx...x.....x.............................s....x...x...............xxx.xx....xxx...x..x..sx.....xxx.xxxxxxxxxx.sxxxxxxxx.x.xxxxxxxxxxxxxxxxxxxxxxxxxx [ 49%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xxx..x.x.xx.x.xxxxxxxxx.x.....xxxx..xxxxxxxxxxxx...s.xx.x...x..x....xx.......xx.........................xxxxx...................x................. [ 57%]
.........................x............................................x...............x........x........x......x..........................x.....x...x..x..x..x...x......x..x...x..........x... [ 65%]
.........x.......xx.x.x.x...x...xx.x...x.....x.......x....x..xx.....x..x.x..x......xxx...x.......x..x...xx..xxxx.........................................................x.................... [ 74%]
..x....x........................................x..........................x.................................................................................................................. [ 82%]
..........................................................s.......s............s...............s.............................................................................................. [ 90%]
.......................................................................................................x.........x.................x.......................................................... [ 98%]
..........................                                                                                                                                                                     [100%]
1812 passed, 132 skipped, 358 xfailed, 4 xpassed in 311.66s (0:05:11)

github-actions bot added the bigquery The BigQuery backend label May 29, 2025

github-actions bot added the tests Issues or PRs related to tests label May 30, 2025

dlstadther reviewed Jun 3, 2025

View reviewed changes

ibis/backends/bigquery/__init__.py Outdated Show resolved Hide resolved

ibis/backends/bigquery/__init__.py Outdated Show resolved Hide resolved

ibis/backends/bigquery/__init__.py Outdated Show resolved Hide resolved

ibis/backends/bigquery/__init__.py Outdated Show resolved Hide resolved

cpcloud reviewed Jun 10, 2025

View reviewed changes

ibis/backends/bigquery/__init__.py Show resolved Hide resolved

cpcloud reviewed Jun 10, 2025

View reviewed changes

cpcloud mentioned this pull request Jun 10, 2025

feat(bigquery): add QueryJobConfig properties to bigquery backend specified at query time #11255

Merged

cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Jun 10, 2025

ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Jun 10, 2025

cpcloud reviewed Jun 11, 2025

View reviewed changes

cpcloud force-pushed the bigquery-custom-job-id-prefix-for-loadjobs branch 2 times, most recently from 481e8c6 to 7c87faf Compare June 11, 2025 11:11

dtran-im and others added 10 commits June 11, 2025 10:00

feat(bigquery): add job_id_prefix to functions that produce bigquery …

60196ac

…loadjobs

feat(bigquery): add job_id_prefix to functions that produce other typ…

626c0b4

…es of jobs

feat(bigquery): add test for job id prefixes

88d8d8e

feat(bigquery): adjust how job_id_prefixes are specified

5aa5e90

feat(bigquery): clean up formatting

c61b237

feat(bigquery): clean up formatting

27f25c0

feat(bigquery): respond to feedback

b066efe

feat(bigquery): modify read_csv test and add insert test

4b2c8c5

chore(bigquery): ensure job is accessible when calling `_get_schema_u…

5c618b8

…sing_query`

test(bigquery): use builtin pytest tmpdir fixture to ensure directo…

795acac

…ry lifetime is at least as long as the test

style: inline some kwargs munging

ca548f5

cpcloud force-pushed the bigquery-custom-job-id-prefix-for-loadjobs branch from 7c87faf to ca548f5 Compare June 11, 2025 14:01

cpcloud merged commit 4006d68 into ibis-project:main Jun 11, 2025
108 of 110 checks passed

feat(bigquery): add job_id_prefix to functions that produce bigquery loadjobs #11265

feat(bigquery): add job_id_prefix to functions that produce bigquery loadjobs #11265

Uh oh!

Conversation

dtran-im commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Issues closed

Uh oh!

dtran-im commented May 29, 2025

Uh oh!

dtran-im commented May 30, 2025

Uh oh!

cpcloud commented Jun 1, 2025

Uh oh!

dtran-im commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtran-im commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlstadther commented Jun 9, 2025

Uh oh!

Uh oh!

cpcloud Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

dtran-im Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

dtran-im Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Jun 10, 2025

Uh oh!

dtran-im commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

dtran-im Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Jun 11, 2025

Uh oh!

cpcloud commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

dtran-im commented May 29, 2025 •

edited

Loading

dtran-im commented Jun 2, 2025 •

edited

Loading

dtran-im commented Jun 10, 2025 •

edited

Loading

cpcloud commented Jun 11, 2025 •

edited

Loading