Skip to content

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Mar 19, 2024

This was motivated to work around pandas not repr'ing frozendict elements properly (seen in #8687), but while poking at that I found:

  • The existing code was unnecessarily nested (the actual dict was boxed in a MappingProxyType which was boxed in a FrozenDict - we can do better by just using storing the data in the FrozenDict itself).
  • There was a bug in the hash implementation where the order mattered, meaning that hash(frozendict(a=1, b=2)) != hash(frozendict(b=2, a=1)). This has since been fixed.

Fixes #8687.

@jcrist jcrist force-pushed the frozendict-is-actual-dict branch from b4435fd to dad9170 Compare March 19, 2024 16:22

__slots__ = ("__view__", "__precomputed_hash__")
__view__: MappingProxyType
class FrozenDict(dict, Mapping[K, V], Hashable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can we avoid the subclass?

If someone wrote

def f(mapping: dict): ...

Then it would be acceptable to pass an instance of FrozenDict, which cannot be mutated, despite being a dict. That's extremely surprising to me, and I suspect it would be to others.

Copy link
Member Author

@jcrist jcrist Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this to fix the related issues no. Both are caused by explicit isinstance(x, dict) checks when they probably want Mapping instead.

We could hide the dict base class from mypy so mypy at least would error on that case, but for runtime checking this would still be a dict subclass. Not sure if that's worth it.

FWIW I think frozendict is unlikely to creep into user-facing code. We might still want to coerce them to actual dict instances on .execute() to avoid this (as nick attempted to do in his PR). But for internal-use only this fix seems like a definite win, both in performance and in correctness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are caused by explicit isinstance(x, dict)

You are saying you found somewhere in pandas code that is incorrect? If you point this to me I can file a PR upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's the complete or right fix, but https://github.com/pandas-dev/pandas/blob/aa3e949e2a2b72588186cb1936edb535713aefa0/pandas/io/formats/printing.py#L223 is one part of the repr issue. The execution issue is (likely) in our code, but is also fixed by the cleanup to make frozendicts actual dicts so I didn't bother delving further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both in performance and in correctness

Could you add a quick benchmark for this?

Copy link
Member

@kszucs kszucs Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @cpcloud's argument, but I am in favour of this change if benchmarks show a clear performance improvement (I expect it to be better without the custom methods). The correctness is related to hashing which can also be fixed while keeping the dictionary proxy view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly ran the graph traversal benchmarks which indeed improved:

---------------------------------------------------------------------------- benchmark 'test_bfs': 2 tests -----------------------------------------------------------------------------
Name (time in ms)              Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bfs (0680_5041894)     2.2367 (1.11)     2.9631 (1.14)     2.3231 (1.08)     0.1120 (1.06)     2.2752 (1.06)     0.0923 (1.0)         25;14  430.4501 (0.93)        171           1
test_bfs (NOW)              2.0154 (1.0)      2.6029 (1.0)      2.1532 (1.0)      0.1060 (1.0)      2.1370 (1.0)      0.1417 (1.54)         79;6  464.4347 (1.0)         239           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_dfs': 2 tests -----------------------------------------------------------------------------
Name (time in ms)              Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_dfs (0680_5041894)     2.2076 (1.11)     2.6657 (1.18)     2.2614 (1.09)     0.0681 (1.0)      2.2378 (1.10)     0.0683 (1.0)         22;13  442.2093 (0.92)        189           1
test_dfs (NOW)              1.9952 (1.0)      2.2626 (1.0)      2.0778 (1.0)      0.0777 (1.14)     2.0388 (1.0)      0.1393 (2.04)         13;0  481.2894 (1.0)          46           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_replace_mapping': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                          Min                Max               Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_replace_mapping (0680_5041894)     9.3224 (1.12)     28.3315 (1.04)     10.2140 (1.14)     3.1599 (1.62)     9.6292 (1.12)     0.2436 (1.0)           1;3   97.9049 (0.87)         35           1
test_replace_mapping (NOW)              8.3443 (1.0)      27.1707 (1.0)       8.9253 (1.0)      1.9461 (1.0)      8.5792 (1.0)      0.4105 (1.69)          1;1  112.0406 (1.0)          92           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_replace_pattern': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                           Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_replace_pattern (0680_5041894)     12.4369 (1.08)     32.3992 (1.07)     13.5555 (1.11)     3.2492 (1.52)     12.7563 (1.08)     0.7327 (1.60)          2;2  73.7707 (0.90)         62           1
test_replace_pattern (NOW)              11.5567 (1.0)      30.1479 (1.0)      12.1997 (1.0)      2.1385 (1.0)      11.8208 (1.0)      0.4588 (1.0)           1;2  81.9690 (1.0)          74           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

NickCrews added a commit to NickCrews/pandas that referenced this pull request Mar 19, 2024
NickCrews added a commit to NickCrews/pandas that referenced this pull request Mar 19, 2024
NickCrews added a commit to NickCrews/pandas that referenced this pull request Mar 19, 2024
mroeschke pushed a commit to pandas-dev/pandas that referenced this pull request Mar 20, 2024
@kszucs kszucs changed the title refactor: make FrozenDict a subclass of dict refactor(common): make FrozenDict a subclass of dict Mar 20, 2024
Co-authored-by: Krisztián Szűcs <[email protected]>
@cpcloud cpcloud added bug Incorrect behavior inside of ibis refactor Issues or PRs related to refactoring the codebase labels Mar 21, 2024
@cpcloud cpcloud added this to the 9.0 milestone Mar 21, 2024
@kszucs kszucs enabled auto-merge (squash) March 21, 2024 14:40
@kszucs kszucs merged commit 32b7514 into ibis-project:main Mar 21, 2024
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: execution of struct literals in the pandas and dask backends returns the wrong value
4 participants