Skip to content

feat: add allow_large_results option #1428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Feb 28, 2025
Merged

feat: add allow_large_results option #1428

merged 20 commits into from
Feb 28, 2025

Conversation

Genesis929
Copy link
Collaborator

@Genesis929 Genesis929 commented Feb 25, 2025

Modified behavior:

  1. New global option: bigframes.options.bigquery.allow_large_results
  2. to_gbq keep the same way of creating temp table.
  3. Anything to gcs(to_csv, to_json, to_parquet) always use explicit destination table.
  4. Other IO methods and to_csv, to_json, to_parquet when save locally will use explicit destination table when bigframes.options.bigquery.allow_large_results is true.
  5. The bigframes.options.bigquery.allow_large_results option will be override by local input allow_large_results.
  6. If allow_large_results=True, we will read the table size(logical size, as we can't access physical size), and warn user that in bigframes 2.0, result > 10g may cause issue and they will need to set allow_large_results=True.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 25, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 25, 2025
@@ -243,11 +244,13 @@ def execute(
*,
ordered: bool = True,
col_id_overrides: Mapping[str, str] = {},
use_explicit_destination: bool = False,
use_explicit_destination: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably add a check, that only one of ordered, use_explicit_destination is allowed at a time

Copy link
Collaborator Author

@Genesis929 Genesis929 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some tests, seems this is against current to_pandas_batches logic? It set both to True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As chatted offline, they may work together, keep it for now.

@@ -333,11 +336,13 @@ def export_gcs(
uri: str,
format: Literal["json", "csv", "parquet"],
export_options: Mapping[str, Union[bool, str]],
allow_large_results: Optional[bool] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think export jobs should just assume a large result. Can't remember why we don't just export as a single job anyways?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, now it's always use a destination table.

@Genesis929 Genesis929 marked this pull request as ready for review February 27, 2025 00:52
@Genesis929 Genesis929 requested review from a team as code owners February 27, 2025 00:52
@Genesis929 Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 27, 2025
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 27, 2025
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit (repeated throughout): I'd like to make sure we only expose allow_large_results as a keyword argument, not allowing positional access. That'll make sure we prevent breakages if folks are coming from pandas.


def to_latex(
self,
buf=None,
columns: Sequence | None = None,
header: bool | Sequence[str] = True,
index: bool = True,
allow_large_results=None,
**kwargs,
) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I'm a bit surprised this works. I guess we are type checking on Python 3.10 where this syntax was added (https://peps.python.org/pep-0604/), not 3.9?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

columns=None,
header=True,
index=True,
allow_large_results=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't want people to access allow_large_results positionally.

Suggested change
allow_large_results=None,
*,
allow_large_results=None,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

return self.to_pandas().to_list()
def tolist(
self,
allow_large_results: Optional[bool] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't want people to access allow_large_results positionally.

Suggested change
allow_large_results: Optional[bool] = None,
*,
allow_large_results: Optional[bool] = None,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -1809,14 +1841,17 @@ def to_markdown(
buf: typing.IO[str] | None = None,
mode: str = "wt",
index: bool = True,
allow_large_results=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't want people to access allow_large_results positionally.

Suggested change
allow_large_results=None,
*,
allow_large_results=None,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -490,17 +490,28 @@ def __getitem__(self, key: int) -> typing.Any:
else:
raise NotImplementedError(f"Index key not supported {key}")

def to_pandas(self) -> pandas.Index:
def to_pandas(self, allow_large_results: Optional[bool] = None) -> pandas.Index:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less necessary in this context, since we aren't trying to mimic pandas, but I'd still like to avoid using this parameter positionally.

Suggested change
def to_pandas(self, allow_large_results: Optional[bool] = None) -> pandas.Index:
def to_pandas(self, *, allow_large_results: Optional[bool] = None) -> pandas.Index:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


def to_numpy(self, dtype=None, **kwargs) -> np.ndarray:
return self.to_pandas().to_numpy(dtype, **kwargs)
def to_numpy(self, dtype=None, allow_large_results=None, **kwargs) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we are trying to mimic pandas (https://pandas.pydata.org/pandas-docs/version/2.1.2/reference/api/pandas.Index.to_numpy.html), so it is very important to restrict use positionally.

Suggested change
def to_numpy(self, dtype=None, allow_large_results=None, **kwargs) -> np.ndarray:
def to_numpy(self, dtype=None, *, allow_large_results=None, **kwargs) -> np.ndarray:

Otherwise, someone might have some pandas code that does index.to_numpy("int64", True) where in pandas that "True" means copy=True, but here it means something else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@Genesis929 Genesis929 requested a review from tswast February 27, 2025 19:43
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks!

@tswast tswast merged commit dd2f488 into main Feb 28, 2025
22 of 23 checks passed
@tswast tswast deleted the query_size_option_huanc branch February 28, 2025 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants