-
Notifications
You must be signed in to change notification settings - Fork 662
feat(bigquery): add job_id_prefix to functions that produce bigquery loadjobs #11265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): add job_id_prefix to functions that produce bigquery loadjobs #11265
Conversation
@cpcloud Do you have any insights on how I might make tests for this? I fiddled a bit with patch & unittest.mock, but I haven't done much with those tools the past and any potential solution seemed exceedingly tortured. |
@cpcloud I added one test that I think may work, but I realized it's in the |
Thanks for the PR! Is there a way we can centralize the API here? I worry that every time there's a new need to customize some aspect of jobs, we'll have to modify N methods/functions' keyword arguments. How are you producing the prefixes? Can we have some kind of callable that generates these that can be passed to |
Hmm, that's a good idea. In our current system we wrap the bq |
@cpcloud I've implemented the solution you suggested, though it feels a bit funny overwriting a class method inside a class method, if you think there's a better way of doing this let me know. The way I implemented it, we lose the ability to specify a particular job_id_prefix for a certain query in a more identifying way ("insert_items_into_table_query_" etc.), but I'm the one who requested this feature and just being able to generate and log a random uuid before the job kicks off suits my purposes. |
@cpcloud Do you have additional feedback for this PR? |
con3.client.load_table_from_file = load_table_from_file | ||
|
||
orig_query = con3.client.query | ||
con3.client._query_num_calls = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue with using mocker
here? You can add it as a fixture input to the test function, e.g., test_read_csv_with_custom_load_job_prefix(con3, mocker):
, and then use it like this:
query_spy = mocker.spy(con3.client, "query")
# do stuff that is supposed to invoke the `query` method
query_spy.assert_called_once() # or whatever assertion method suits your testing use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try that - I haven't used mocking much in unit tests before and wasn't quite sure how to achieve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to implement the mocker.spy and added a similar test for the insert
method. Let me know if there's anything more to do here.
LGTM, I'll kick off the test suite now. |
Ok, so I see a timeout, a memtable cleanup error for exasol, "feature not exposed in athena" errors, some 500 server errors, and some unknown operational errors among the failures. I don't think these are relevant to my PR. For the BQ failures though, I see a lot of tests failing due to This seems to stem from this statement in the
I confirmed that this is the culprit, by imitating the test code and what
That's about as far as I've been able to get with my troubleshooting - I don't see any recent changes to
Is there a way to verify whether my changes are the culprit here? (Do these tests pass on |
ibis/backends/bigquery/__init__.py
Outdated
@@ -670,16 +749,14 @@ def _make_session(self) -> tuple[str, str]: | |||
return None | |||
|
|||
def _get_schema_using_query(self, query: str) -> sch.Schema: | |||
job = self.client.query( | |||
job = self._client_query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is here. _get_schema_using_query
needs access to a bq.QueryJob
, but _client_query
always returns a RowIterator
.
I'll fix this up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh thank you, I missed that!
481e8c6
to
7c87faf
Compare
I will run the BigQuery tests locally to avoid thrashing CI, and I will post the results here. |
…ry lifetime is at least as long as the test
7c87faf
to
ca548f5
Compare
Fixed up a couple issues in the tests:
|
This is now passing, so I'll merge. Thanks for your work @dtran-im!
|
Description of changes
I realized that my changes in #11233 and #11237 I completely focused on directly-created query jobs and completely overlooked functions which create bigquery load jobs and other queries, mainly in the process of uploading data. Granted, the directly-created query jobs comprise the bulk of what I am concerned with, but it seemed proper to include other types of queries in this feature.
However, this has proved a bit stickier than anticipated. The main situation I wasn't quite sure how to address was where one function call potentially creates several bigquery jobs, such as a job to truncate or drop a table if it already exists, a job to upload the data to a (temporary?) table in bq, and a job to select everything from that table and insert it into the target table. I chose to address this with more specificity rather than less, creating kwargs such as
drop_job_id_prefix
andinsert_job_id_prefix
, and have attempted to be consistent about applying that, but I could be convinced that this is overkill and we should just pass along onejob_id_prefix
for any jobs stemming from a certain ibis function call. (This would be ok for my purposes at least.)I believe I have added the kwargs to all relevant methods consistently, but it is worth a close review. A notable exception is I did not add the kwargs to the
_run_pre_execute_hooks
call fromcreate_view
, since I assume that should not be creating a load job to upload data since it's creating a view, but that assumption could use verification.It's notable that this required essentially copying the parent class'
insert
,register_in_memory_tables
, and other methods in order to pass down a job_id_prefix variable.In addition, since these methods do not return anything - there are no "results" from a drop table query that you would generally desire to return - I'm not immediately seeing a straightforward way to write unit tests for these. But I have yet to explore some options.
Issues closed