Skip to content

Conversation

@cosmicBboy
Copy link
Collaborator

@cosmicBboy cosmicBboy commented Jun 25, 2025

hey @deepyaman, here's a potential implementation to handle dataframe-level checks to prevent pivoting failure cases to long-form data, as discussed here: #2041 (comment)

It just operates on the failure cases converted into a pandas dataframe.

A related question: are you concerned that the failure cases dataframe won't fit into memory? What is the risk of this?

@cosmicBboy cosmicBboy requested a review from deepyaman June 25, 2025 13:48
Copy link
Collaborator

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Minor question:

    schema_context        column         check  check_number      failure_case  index
0  DataFrameSchema  failure_case  custom_check             0  {'a': 0, 'b': 1}      0
1  DataFrameSchema  failure_case  custom_check             0  {'a': 1, 'b': 0}      1
2  DataFrameSchema  failure_case  custom_check             0  {'a': 1, 'b': 1}      2
3  DataFrameSchema  failure_case      <lambda>             1  {'a': 0, 'b': 1}      0
4  DataFrameSchema  failure_case      <lambda>             1  {'a': 1, 'b': 0}      1
5  DataFrameSchema  failure_case      <lambda>             1  {'a': 1, 'b': 1}      2

What does index mean here? Is it basically like an index for the result, or do people expect to be able to use it to access data in the original dataframe? (Which wouldn't make sense for Ibis)

A related question: are you concerned that the failure cases dataframe won't fit into memory? What is the risk of this?

Do any of the other backends have some capability for limiting output here? It sounds useful even if for pandas or Polars, if you have 1000 failures cases it wouldn't show them all. If so, could .limit() the result from Ibis in the same place.

@cosmicBboy
Copy link
Collaborator Author

What does index mean here? Is it basically like an index for the result, or do people expect to be able to use it to access data in the original dataframe? (Which wouldn't make sense for Ibis)

Good call out! this is just the pandas index created from calling to_pandas. If it's confusing we can remove it

@cosmicBboy
Copy link
Collaborator Author

re: limiting failure cases, Checks already has an n_failure_cases parameter:

n_failure_cases: Optional[int] = None,
, which defaults to none. Maybe this can be configured at a global level or backend level so that we have some reasonable default (1000?)

@deepyaman
Copy link
Collaborator

What does index mean here? Is it basically like an index for the result, or do people expect to be able to use it to access data in the original dataframe? (Which wouldn't make sense for Ibis)

Good call out! this is just the pandas index created from calling to_pandas. If it's confusing we can remove it

I guess it makes sense to remove it.

@deepyaman
Copy link
Collaborator

re: limiting failure cases, Checks already has an n_failure_cases parameter:

n_failure_cases: Optional[int] = None,

, which defaults to none. Maybe this can be configured at a global level or backend level so that we have some reasonable default (1000?)

I think that would be fine. But probably more importantly, given the option is there, I don't see it implemented on Ibis (or Polars)? I'll have to check, but if it's not, I can try and put up a PR.

@cosmicBboy
Copy link
Collaborator Author

I don't see it implemented on Ibis (or Polars)

Yeah, now that I revisit the code, I think this only touches column-level checks in the pandas API. Lemme take a stab implementing them across the 3 backends, I'll tap you if I have any q's about ibis.

@cosmicBboy
Copy link
Collaborator Author

will merge this PR now.

@cosmicBboy cosmicBboy merged commit 7a81965 into main Jun 25, 2025
200 checks passed
@cosmicBboy cosmicBboy deleted the nielsb/ibis-failure-cases branch June 26, 2025 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants