Skip to content
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ I/O
- Bug in :meth:`DataFrame.to_msgpack` when serializing data of the numpy.bool_ datatype (:issue:`18390`)
- Bug in :func:`read_json` not decoding when reading line deliminted JSON from S3 (:issue:`17200`)
- Bug in :func:`pandas.io.json.json_normalize` to avoid modification of ``meta`` (:issue:`18610`)
- Bug when storing NaN-only categorical columns in hdf5 store (:issue:`18413`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug when reading ......in a :class:`HDFStore`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done.


Plotting
^^^^^^^^
Expand Down
15 changes: 11 additions & 4 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2137,10 +2137,17 @@ def convert(self, values, nan_rep, encoding):
# if we have stored a NaN in the categories
# then strip it; in theory we could have BOTH
# -1s in the codes and nulls :<
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values
if categories is None:
# Handle case of NaN-only categorical columns in which case
# the categories are an empty array; when this is stored,
# pytables cannot write a zero-len array, so on readback
# the categories would be None and `read_hdf()` would fail.
categories = Index([], dtype=np.float64)
else:
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values

self.data = Categorical.from_codes(codes,
categories=categories,
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -4928,6 +4928,23 @@ def test_categorical_conversion(self):
result = read_hdf(path, 'df', where='obsids=B')
tm.assert_frame_equal(result, expected)

def test_categorical_nan_only_columns(self):
# GH18413
# Check that read_hdf with categorical columns with NaN-only values can
# be read back.
df = pd.DataFrame({
'a': ['a', 'b', 'c', np.nan],
'b': [np.nan, np.nan, np.nan, np.nan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another column with an array like pd.Series([None]* 3, dtype=object). this might fail your test because the original array was an all-null object type (and no float). but let's see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, seems to work fine.

'c': [1, 2, 3, 4]
})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')
expected = df
with ensure_clean_path(self.path) as path:
df.to_hdf(path, 'df', format='table', data_columns=True)
result = read_hdf(path, 'df')
tm.assert_frame_equal(result, expected)

def test_duplicate_column_name(self):
df = DataFrame(columns=["a", "a"], data=[[0, 0]])

Expand Down