feat: add allow_large_results option #1428

Genesis929 · 2025-02-25T22:28:28Z

Modified behavior:

New global option: bigframes.options.bigquery.allow_large_results
to_gbq keep the same way of creating temp table.
Anything to gcs(to_csv, to_json, to_parquet) always use explicit destination table.
Other IO methods and to_csv, to_json, to_parquet when save locally will use explicit destination table when bigframes.options.bigquery.allow_large_results is true.
The bigframes.options.bigquery.allow_large_results option will be override by local input allow_large_results.
If allow_large_results=True, we will read the table size(logical size, as we can't access physical size), and warn user that in bigframes 2.0, result > 10g may cause issue and they will need to set allow_large_results=True.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

TrevorBergeron · 2025-02-26T00:19:16Z

bigframes/session/executor.py

@@ -243,11 +244,13 @@ def execute(
        *,
        ordered: bool = True,
        col_id_overrides: Mapping[str, str] = {},
-        use_explicit_destination: bool = False,
+        use_explicit_destination: Optional[bool] = False,


should probably add a check, that only one of ordered, use_explicit_destination is allowed at a time

Did some tests, seems this is against current to_pandas_batches logic? It set both to True.

As chatted offline, they may work together, keep it for now.

TrevorBergeron · 2025-02-26T00:20:14Z

bigframes/session/executor.py

@@ -333,11 +336,13 @@ def export_gcs(
        uri: str,
        format: Literal["json", "csv", "parquet"],
        export_options: Mapping[str, Union[bool, str]],
+        allow_large_results: Optional[bool] = None,


I think export jobs should just assume a large result. Can't remember why we don't just export as a single job anyways?

Updated, now it's always use a destination table.

tswast

One nit (repeated throughout): I'd like to make sure we only expose allow_large_results as a keyword argument, not allowing positional access. That'll make sure we prevent breakages if folks are coming from pandas.

tswast · 2025-02-27T17:22:22Z

bigframes/dataframe.py


    def to_latex(
        self,
        buf=None,
        columns: Sequence | None = None,
        header: bool | Sequence[str] = True,
        index: bool = True,
+        allow_large_results=None,
        **kwargs,
    ) -> str | None:


Aside: I'm a bit surprised this works. I guess we are type checking on Python 3.10 where this syntax was added (https://peps.python.org/pep-0604/), not 3.9?

tswast · 2025-02-27T17:23:54Z

bigframes/series.py

+        columns=None,
+        header=True,
+        index=True,
+        allow_large_results=None,


Nit: we don't want people to access allow_large_results positionally.

Suggested change

allow_large_results=None,

*,

allow_large_results=None,

tswast · 2025-02-27T17:24:20Z

bigframes/series.py

-        return self.to_pandas().to_list()
+    def tolist(
+        self,
+        allow_large_results: Optional[bool] = None,


Nit: we don't want people to access allow_large_results positionally.

Suggested change

allow_large_results: Optional[bool] = None,

*,

allow_large_results: Optional[bool] = None,

tswast · 2025-02-27T17:24:38Z

bigframes/series.py

@@ -1809,14 +1841,17 @@ def to_markdown(
        buf: typing.IO[str] | None = None,
        mode: str = "wt",
        index: bool = True,
+        allow_large_results=None,


Nit: we don't want people to access allow_large_results positionally.

Suggested change

allow_large_results=None,

*,

allow_large_results=None,

tswast · 2025-02-27T17:25:50Z

bigframes/core/indexes/base.py

@@ -490,17 +490,28 @@ def __getitem__(self, key: int) -> typing.Any:
        else:
            raise NotImplementedError(f"Index key not supported {key}")

-    def to_pandas(self) -> pandas.Index:
+    def to_pandas(self, allow_large_results: Optional[bool] = None) -> pandas.Index:


It's less necessary in this context, since we aren't trying to mimic pandas, but I'd still like to avoid using this parameter positionally.

Suggested change

def to_pandas(self, allow_large_results: Optional[bool] = None) -> pandas.Index:

def to_pandas(self, *, allow_large_results: Optional[bool] = None) -> pandas.Index:

tswast · 2025-02-27T17:27:46Z

bigframes/core/indexes/base.py


-    def to_numpy(self, dtype=None, **kwargs) -> np.ndarray:
-        return self.to_pandas().to_numpy(dtype, **kwargs)
+    def to_numpy(self, dtype=None, allow_large_results=None, **kwargs) -> np.ndarray:


In this case, we are trying to mimic pandas (https://pandas.pydata.org/pandas-docs/version/2.1.2/reference/api/pandas.Index.to_numpy.html), so it is very important to restrict use positionally.

Suggested change

def to_numpy(self, dtype=None, allow_large_results=None, **kwargs) -> np.ndarray:

def to_numpy(self, dtype=None, *, allow_large_results=None, **kwargs) -> np.ndarray:

Otherwise, someone might have some pandas code that does index.to_numpy("int64", True) where in pandas that "True" means copy=True, but here it means something else.

tswast

THanks!

feat: add allow_large_results option

1f39765

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 25, 2025

Genesis929 added 2 commits February 25, 2025 22:37

add to_arrow

f0b632e

add the ones that only uses to_pandas()

9c1e9db

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 25, 2025

add to_csv/json/parquet

50abb66

TrevorBergeron reviewed Feb 26, 2025

View reviewed changes

mypy fix

6ba3d12

TrevorBergeron reviewed Feb 26, 2025

View reviewed changes

Genesis929 and others added 10 commits February 26, 2025 00:42

gcs logic update and execute logic update.

63a422c

Merge branch 'main' into query_size_option_huanc

d204c56

add to_pandas_batches and to_pandas large test.

8585ab8

add to_pandas_batches and to_pandas large test.

6bbeeea

add unit tests

c9e67b0

add to_pandas and to_arrow override test

1277613

add to_pandas and to_arrow override test

bb6c9bf

add copyright

8ced1b6

modify index to_pandas to match behavior of series and df, add tests.

475958c

update warning message

584d27d

Genesis929 requested review from tswast and GarrettWu February 27, 2025 00:52

Merge branch 'main' into query_size_option_huanc

e7977fd

Genesis929 marked this pull request as ready for review February 27, 2025 00:52

Genesis929 requested review from a team as code owners February 27, 2025 00:52

blunderbuss-gcf bot assigned jiaxunwu Feb 27, 2025

Genesis929 added 2 commits February 27, 2025 01:05

update warning message

df078c3

update warning message test

3c75a46

Genesis929 requested a review from TrevorBergeron February 27, 2025 01:07

Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 27, 2025

bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 27, 2025

tswast requested changes Feb 27, 2025

View reviewed changes

Genesis929 added 2 commits February 27, 2025 19:13

update parameters

1bc2eb1

test fix

b7d6592

Genesis929 requested a review from tswast February 27, 2025 19:43

tswast approved these changes Feb 28, 2025

View reviewed changes

tswast merged commit dd2f488 into main Feb 28, 2025
22 of 23 checks passed

tswast deleted the query_size_option_huanc branch February 28, 2025 17:43

release-please bot mentioned this pull request Feb 28, 2025

chore(main): release 1.39.0 #1421

Merged

	allow_large_results: Optional[bool] = None,
	*,
	allow_large_results: Optional[bool] = None,

	def to_pandas(self, allow_large_results: Optional[bool] = None) -> pandas.Index:
	def to_pandas(self, *, allow_large_results: Optional[bool] = None) -> pandas.Index:

	def to_numpy(self, dtype=None, allow_large_results=None, **kwargs) -> np.ndarray:
	def to_numpy(self, dtype=None, , allow_large_results=None, *kwargs) -> np.ndarray:

feat: add allow_large_results option #1428

feat: add allow_large_results option #1428

Uh oh!

Conversation

Genesis929 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Genesis929 Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Genesis929 commented Feb 25, 2025 •

edited

Loading

Genesis929 Feb 26, 2025 •

edited

Loading