Skip to content

Conversation

mzeitlin11
Copy link
Member

Long overdue followup to #38744. Only half the diff is relevant since the changes are duplicated over the if rank_t is object... else with no_gil... structure.

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean labels Mar 23, 2021
@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2021

Can you post ASVs?

@mzeitlin11
Copy link
Member Author

mzeitlin11 commented Mar 23, 2021

Added ASVs here:

before           after         ratio
     [ced764da]       [f6a04b79]
     <master>         <cln/rank_1d>
          429±8μs          430±2μs     1.00  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'direct')
          438±4μs          432±6μs     0.99  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation')
          490±7μs         469±20μs     0.96  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'direct')
          487±3μs          487±3μs     1.00  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'transformation')
         480±40μs         483±20μs     1.01  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
          475±2μs         467±40μs     0.98  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
      1.49±0.08ms      1.55±0.01ms     1.03  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'direct')
      1.55±0.03ms      1.57±0.01ms     1.01  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'transformation')
          474±3μs          472±3μs     1.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')
          477±2μs          473±9μs     0.99  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation')
          468±9μs          474±4μs     1.01  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
          473±3μs          474±7μs     1.00  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
          475±1μs          483±2μs     1.02  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'direct')
         477±10μs         499±10μs     1.05  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')
         367±10μs          374±2μs     1.02  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'direct')
         373±20μs          378±6μs     1.01  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'transformation')
       1.13±0.4ms      1.13±0.02ms     1.00  groupby.RankWithTies.time_rank_ties('datetime64', 'average')
      1.12±0.04ms      1.12±0.03ms     0.99  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
       1.14±0.1ms      1.12±0.02ms     0.98  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
      1.09±0.08ms      1.13±0.01ms     1.03  groupby.RankWithTies.time_rank_ties('datetime64', 'max')
      1.11±0.07ms      1.12±0.01ms     1.01  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
      1.05±0.05ms      1.07±0.01ms     1.02  groupby.RankWithTies.time_rank_ties('float32', 'average')
         942±70μs       1.05±0.1ms    ~1.11  groupby.RankWithTies.time_rank_ties('float32', 'dense')
      1.05±0.03ms         984±60μs     0.94  groupby.RankWithTies.time_rank_ties('float32', 'first')
      1.08±0.01ms      1.02±0.07ms     0.95  groupby.RankWithTies.time_rank_ties('float32', 'max')
      1.06±0.02ms         987±70μs     0.93  groupby.RankWithTies.time_rank_ties('float32', 'min')
         996±60μs      1.05±0.01ms     1.06  groupby.RankWithTies.time_rank_ties('float64', 'average')
      1.03±0.06ms         1.04±0ms     1.01  groupby.RankWithTies.time_rank_ties('float64', 'dense')
      1.06±0.09ms      1.05±0.01ms     0.99  groupby.RankWithTies.time_rank_ties('float64', 'first')
      1.05±0.06ms      1.04±0.08ms     0.99  groupby.RankWithTies.time_rank_ties('float64', 'max')
         997±50μs      1.03±0.07ms     1.03  groupby.RankWithTies.time_rank_ties('float64', 'min')
      1.08±0.07ms      1.05±0.03ms     0.98  groupby.RankWithTies.time_rank_ties('int64', 'average')
       1.54±0.6ms      1.06±0.02ms    ~0.69  groupby.RankWithTies.time_rank_ties('int64', 'dense')
      1.08±0.03ms      1.06±0.05ms     0.98  groupby.RankWithTies.time_rank_ties('int64', 'first')
       1.07±0.1ms      1.06±0.08ms     0.99  groupby.RankWithTies.time_rank_ties('int64', 'max')
       1.12±0.2ms      1.06±0.01ms     0.95  groupby.RankWithTies.time_rank_ties('int64', 'min')
      10.8±0.06ms       10.8±0.4ms     1.00  series_methods.Rank.time_rank('float')
      8.07±0.04ms      8.23±0.08ms     1.02  series_methods.Rank.time_rank('int')
        60.5±30ms       55.3±0.4ms     0.91  series_methods.Rank.time_rank('object')
      7.97±0.05ms       8.57±0.3ms     1.07  series_methods.Rank.time_rank('uint')

@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

this is a pure refactor right?

@jreback jreback added this to the 1.3 milestone Mar 23, 2021
@jreback jreback requested a review from jbrockmendel March 23, 2021 16:21
@mzeitlin11
Copy link
Member Author

this is a pure refactor right?

Yeah, no behavior change

@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

kk lgtm. cc @jbrockmendel

out[lexsort_indexer[j]] = j + 1 - grp_start
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = 2 * i - j - dups + 2 - grp_start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be obvious where 2 * i - j - dups + 2 - grp_start comes from? looks like equivalent to (i - dups + 1) + (i - j + 1) - grp_start

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, good opportunity to clarify here, have added some commentary

@jbrockmendel
Copy link
Member

Potential clarification, otherwise LGTM

@jreback jreback merged commit bc62e76 into pandas-dev:master Mar 23, 2021
@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the cln/rank_1d branch March 23, 2021 20:32
vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants