Skip to content

BUG: pd.cut() sometimes puts NaNs into bins. #44075

Open
@akdor1154

Description

@akdor1154

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

bins = pd.IntervalIndex.from_breaks(
    range(0, 102, 1),
    closed='left', dtype='interval[int64]'
)

print(
  pd.DataFrame(dict(x=[1.2, np.nan, 10.2]))
  .pipe(lambda _: _.assign(
    bin=pd.cut(_.x, bins)
  ))
)

#      x       bin
# 0   1.2    [1, 2)
# 1   NaN  [50, 51)
# 2  10.2  [10, 11)

Issue Description

With certain bin setups, pd.cut puts NaNs into the central bin instead of setting as NaN.
It seems to only be certain bin setups which is weird:

bins = pd.IntervalIndex.from_breaks(
    range(0, 102, 1), # as per repro - breaks
    # range(0, 100, 1), # works
    # range(0, 101, 1), # works
    # range(0, 202, 1), # breaks
    closed='left', dtype='interval[int64]'
)

It also seems to go away if I use pd.NA instead of np.nan or math.nan.

Seems different to #31586 as searchsorted seems to work correctly (tested with

def mybin(val: pd.Series, bins: pd.IntervalIndex):
    bins_np = bins.values.left
    return bins_np.searchsorted(val.to_numpy())

... .assign(mybin=mybin(_.x, bins)

)

Expected Behavior

Would expect the output of repro to be something like

#      x       bin
# 0   1.2    [1, 2)
# 1   NaN  pd.NA
# 2  10.2  [10, 11)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.9.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.11.0-37-generic
Version          : #41-Ubuntu SMP Mon Sep 20 16:39:20 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_AU.UTF-8
LOCALE           : en_AU.UTF-8

pandas           : 1.3.4
numpy            : 1.21.2
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.0.1
setuptools       : 56.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.2
IPython          : 7.28.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.3
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 3.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.7.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatecutcut, qcut

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions