Skip to content

string dtype fixes #3170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 27, 2025
Merged

string dtype fixes #3170

merged 9 commits into from
Jun 27, 2025

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jun 25, 2025

this PR fixes the issues reported in #3167.

There were two problems driving that issue:

  • zarr python was not using the string "str" to denote a variable length string data type. Instead, in main we use NumPy's behavior and map "str" to a fixed-length UTF-32 string, which is not a useful data type for most people.

    In this PR, when a user requests dtype=str or dtype="str" or dtype="string" they get a VariableLengthUTF8 dtype instead of the nearly-useless U dtype.

  • zarr python could create data types for length-0 scalars. Since these data types cannot be associated with arrays that contain any values, they are useless. See disallow 0-length fixed-size data types #3168. This is a quirk of NumPy that we don't need to support. This PR contains logic to ensure that we don't create data types for length-0 scalars.

note that while debugging this, I found that zarr-python 2.x could create arrays with where the dtype in metadata indicates length 0, but the actual array uses a dtype with size 1. We will need some support for this, but I don't think we need that in this PR.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jun 25, 2025
@d-v-b d-v-b requested a review from rabernat June 25, 2025 17:26
@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 25, 2025

another important change: I altered the zarr v3 name of the VariableLengthUTF8 dtype to be "string" instead of "variable_length_utf8". This should resolve compatibility issues with zarr-python 3.0.8.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jun 25, 2025
scalar_v3_params = (
(FixedLengthUTF32(length=0), ""),
(FixedLengthUTF32(length=1), ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line leaves me a little confused on what the intended behavior for FIxedLengthString with an empty fill value is. or in general does the fill vlaue of a fixed length string need to be the same length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going with numpy's semantics, which assigns the empty string to a sequence of null bytes:

>>> np.array([''], dtype="U1").tobytes()
b'\x00\x00\x00\x00'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a situation where the array data type and the scalar data type (the thing you get when you index that array) are not the same:

>>> np.dtype('U0').type('a').dtype
dtype('<U1')
>>> np.dtype('U0').type('').dtype 
dtype('<U')
>>> np.array([''], dtype="U0").dtype   
dtype('<U1')
>>> np.array([''], dtype="U0")[0].dtype
dtype('<U')

Copy link
Contributor

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good from me! Seems to fix the issue I was seeing with Xarray.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 26, 2025

for posterity, numpy does permit the creation of arrays with a "V0" data type, i.e. raw bytes where each element has length 0:

>>> np.array([b''], dtype="V0").dtype
dtype('V')
>>> np.array([b''], dtype="V0").tobytes()
b''
>>> np.array([b'',b'',b'',b''], dtype="V0").tobytes()
b''

Arrays with this data type are all the empty byte string b"". I don't think this is meaningful for zarr.

The changes in this PR will make this data type unrepresentable in zarr python. I hope we are OK with this.

@d-v-b d-v-b requested a review from jhamman June 26, 2025 14:59
@d-v-b d-v-b merged commit bbdefac into zarr-developers:main Jun 27, 2025
55 of 70 checks passed
@d-v-b d-v-b added this to the 3.1.0 milestone Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants