Skip to content

Conversation

dtran-im
Copy link
Contributor

@dtran-im dtran-im commented May 29, 2025

Description of changes

I realized that my changes in #11233 and #11237 I completely focused on directly-created query jobs and completely overlooked functions which create bigquery load jobs and other queries, mainly in the process of uploading data. Granted, the directly-created query jobs comprise the bulk of what I am concerned with, but it seemed proper to include other types of queries in this feature.

However, this has proved a bit stickier than anticipated. The main situation I wasn't quite sure how to address was where one function call potentially creates several bigquery jobs, such as a job to truncate or drop a table if it already exists, a job to upload the data to a (temporary?) table in bq, and a job to select everything from that table and insert it into the target table. I chose to address this with more specificity rather than less, creating kwargs such as drop_job_id_prefix and insert_job_id_prefix, and have attempted to be consistent about applying that, but I could be convinced that this is overkill and we should just pass along one job_id_prefix for any jobs stemming from a certain ibis function call. (This would be ok for my purposes at least.)

I believe I have added the kwargs to all relevant methods consistently, but it is worth a close review. A notable exception is I did not add the kwargs to the _run_pre_execute_hooks call from create_view, since I assume that should not be creating a load job to upload data since it's creating a view, but that assumption could use verification.

It's notable that this required essentially copying the parent class' insert, register_in_memory_tables, and other methods in order to pass down a job_id_prefix variable.

In addition, since these methods do not return anything - there are no "results" from a drop table query that you would generally desire to return - I'm not immediately seeing a straightforward way to write unit tests for these. But I have yet to explore some options.

Issues closed

@github-actions github-actions bot added the bigquery The BigQuery backend label May 29, 2025
@dtran-im
Copy link
Contributor Author

@cpcloud Do you have any insights on how I might make tests for this? I fiddled a bit with patch & unittest.mock, but I haven't done much with those tools the past and any potential solution seemed exceedingly tortured.

@github-actions github-actions bot added the tests Issues or PRs related to tests label May 30, 2025
@dtran-im
Copy link
Contributor Author

@cpcloud I added one test that I think may work, but I realized it's in the /system/ tests which don't seem to get kicked off automatically. Would you be able to kick those off?

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2025

Thanks for the PR!

Is there a way we can centralize the API here?

I worry that every time there's a new need to customize some aspect of jobs, we'll have to modify N methods/functions' keyword arguments.

How are you producing the prefixes? Can we have some kind of callable that generates these that can be passed to ibis.bigquery.connect instead? Then, those callables would be internally passed around. That doesn't eliminate the problem of making sure to pass them wherever jobs are created, but it does central the customization of the IDs to a single call site.

@dtran-im
Copy link
Contributor Author

dtran-im commented Jun 2, 2025

Can we have some kind of callable that generates these that can be passed to ibis.bigquery.connect instead? Then, those callables would be internally passed around. That doesn't eliminate the problem of making sure to pass them wherever jobs are created, but it does central the customization of the IDs to a single call site.

Hmm, that's a good idea. In our current system we wrap the bq Client.query call with a function that reads the sql from a file, generates the job ID based on the sql filename and does some other stuff... I will need to think about this a bit though.

@dtran-im
Copy link
Contributor Author

dtran-im commented Jun 3, 2025

@cpcloud I've implemented the solution you suggested, though it feels a bit funny overwriting a class method inside a class method, if you think there's a better way of doing this let me know.

The way I implemented it, we lose the ability to specify a particular job_id_prefix for a certain query in a more identifying way ("insert_items_into_table_query_" etc.), but I'm the one who requested this feature and just being able to generate and log a random uuid before the job kicks off suits my purposes.

@dlstadther
Copy link
Contributor

@cpcloud Do you have additional feedback for this PR?

con3.client.load_table_from_file = load_table_from_file

orig_query = con3.client.query
con3.client._query_num_calls = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue with using mocker here? You can add it as a fixture input to the test function, e.g., test_read_csv_with_custom_load_job_prefix(con3, mocker):, and then use it like this:

query_spy = mocker.spy(con3.client, "query")
# do stuff that is supposed to invoke the `query` method
query_spy.assert_called_once() # or whatever assertion method suits your testing use case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that - I haven't used mocking much in unit tests before and wasn't quite sure how to achieve this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to implement the mocker.spy and added a similar test for the insert method. Let me know if there's anything more to do here.

@cpcloud
Copy link
Member

cpcloud commented Jun 10, 2025

LGTM, I'll kick off the test suite now.

@cpcloud cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Jun 10, 2025
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Jun 10, 2025
@dtran-im
Copy link
Contributor Author

dtran-im commented Jun 10, 2025

Ok, so I see a timeout, a memtable cleanup error for exasol, "feature not exposed in athena" errors, some 500 server errors, and some unknown operational errors among the failures. I don't think these are relevant to my PR.

For the BQ failures though, I see a lot of tests failing due to ValueError: Length mismatch: Expected axis has 1 elements, new values have 0 elements. TL;DR if my changes have broken things, I can't figure out how they possibly did.

This seems to stem from this statement in the execute function: df.columns = schema.names
Which the schema with zero elements comes from earlier in the execute function:

table_expr = expr.as_table()
schema = table_expr.schema() - ibis.schema({"_TABLE_SUFFIX": "string"})

I confirmed that this is the culprit, by imitating the test code and what execute runs, where dtran.tmp is a table that already exists in bq and contains data in the project specified:

>>> con = bigquery.connect(project_id="****")
>>> table = "dtran.tmp"
>>> limit = "LIMIT 10"
>>> expr = con.sql(f"SELECT * FROM {table} {limit}")
>>> expr.as_table()
SQLQueryResult
  query:
    SELECT * FROM dana_tran.tmp LIMIT 10
  schema:
    <empty schema>
>>> expr.as_table().schema()
ibis.Schema {
}

That's about as far as I've been able to get with my troubleshooting - I don't see any recent changes to .as_table() or .schema(). The bigquery backend method get_schema seems to work as intended:

>>> con.get_schema("tmp", database="dtran")
ibis.Schema {
  column_1  int64
  column_2  string
}

Is there a way to verify whether my changes are the culprit here? (Do these tests pass on main currently?) Thanks.

@@ -670,16 +749,14 @@ def _make_session(self) -> tuple[str, str]:
return None

def _get_schema_using_query(self, query: str) -> sch.Schema:
job = self.client.query(
job = self._client_query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is here. _get_schema_using_query needs access to a bq.QueryJob, but _client_query always returns a RowIterator.

I'll fix this up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh thank you, I missed that!

@cpcloud cpcloud force-pushed the bigquery-custom-job-id-prefix-for-loadjobs branch 2 times, most recently from 481e8c6 to 7c87faf Compare June 11, 2025 11:11
@cpcloud
Copy link
Member

cpcloud commented Jun 11, 2025

I will run the BigQuery tests locally to avoid thrashing CI, and I will post the results here.

@cpcloud cpcloud force-pushed the bigquery-custom-job-id-prefix-for-loadjobs branch from 7c87faf to ca548f5 Compare June 11, 2025 14:01
@cpcloud
Copy link
Member

cpcloud commented Jun 11, 2025

Fixed up a couple issues in the tests:

  1. I don't think it was necessary to track the call count difference, since we're just testing that the most recent call (which is what call_args tracks) contains the job_id_prefix that we set.
  2. The use of mktempd was incorrect. When mktempd is used as a context manager, the temporary directory is removed when the context manager exits, so you can't use any files created in that directory after the context manager exits, because the files won't be there. The solution is to use the tmpdir fixture, which ships with a standard pytest install. This tmpdir is bound to the lifetime of the test run, i.e., when the test exits for any reason, tmpdir is removed. tmpdir is never removed before the test finishes, so it's safe to rely on its contents existing during the test's execution.

@cpcloud
Copy link
Member

cpcloud commented Jun 11, 2025

This is now passing, so I'll merge. Thanks for your work @dtran-im!

cloud in 🌐 falcon in …/ibis on  bigquery-custom-job-id-prefix-for-loadjobs is 📦 v10.5.0 via 🐍 v3.13.3 via ❄️  impure (ibis-3.13-env) took 4m59s
❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
x....x.x.............x.......x..x..x..................x.......xxx.........x..x.....x...........x....x........x..x................x.....sssssssssssssssssssssssssssssssssssssssssssssssssssssss [  8%]
ssssssssssssssssssssssssssssssssssssssssssss....xx........x....sssssssssssssssssssss....x.x.xx................x..............xxx....................x.x.....x.x..............x..x............. [ 16%]
.........................x.x...x.....x.x....x...xx.....x.....x..x.......x.................x.x.........x.....X.xx.x.x....x......xX..x..x......X.xx.x.x....xxx.......X............x...x...xx..x. [ 24%]
.x.....x.x........xx...x...x....x....x.............x..x.....x.x................x..x...x.x....x...xx..xx.xx....x.........xxxx..xx...x..x.....xxxx...x...x.............xx..x...........xx....x.. [ 32%]
...xx..x............x..x.xx..x............x.x.xx...sxxx.......x.x...xx..................xxxx.....x............xx.....s..................x.....................s.......x..........s............ [ 41%]
.....................................x....xx...x.....x.............................s....x...x...............xxx.xx....xxx...x..x..sx.....xxx.xxxxxxxxxx.sxxxxxxxx.x.xxxxxxxxxxxxxxxxxxxxxxxxxx [ 49%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xxx..x.x.xx.x.xxxxxxxxx.x.....xxxx..xxxxxxxxxxxx...s.xx.x...x..x....xx.......xx.........................xxxxx...................x................. [ 57%]
.........................x............................................x...............x........x........x......x..........................x.....x...x..x..x..x...x......x..x...x..........x... [ 65%]
.........x.......xx.x.x.x...x...xx.x...x.....x.......x....x..xx.....x..x.x..x......xxx...x.......x..x...xx..xxxx.........................................................x.................... [ 74%]
..x....x........................................x..........................x.................................................................................................................. [ 82%]
..........................................................s.......s............s...............s.............................................................................................. [ 90%]
.......................................................................................................x.........x.................x.......................................................... [ 98%]
..........................                                                                                                                                                                     [100%]
1812 passed, 132 skipped, 358 xfailed, 4 xpassed in 311.66s (0:05:11)

@cpcloud cpcloud merged commit 4006d68 into ibis-project:main Jun 11, 2025
108 of 110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Bigquery - custom job IDs
3 participants