-
-
Notifications
You must be signed in to change notification settings - Fork 356
string dtype fixes #3170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string dtype fixes #3170
Conversation
another important change: I altered the zarr v3 name of the |
scalar_v3_params = ( | ||
(FixedLengthUTF32(length=0), ""), | ||
(FixedLengthUTF32(length=1), ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line leaves me a little confused on what the intended behavior for FIxedLengthString with an empty fill value is. or in general does the fill vlaue of a fixed length string need to be the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going with numpy's semantics, which assigns the empty string to a sequence of null bytes:
>>> np.array([''], dtype="U1").tobytes()
b'\x00\x00\x00\x00'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a situation where the array data type and the scalar data type (the thing you get when you index that array) are not the same:
>>> np.dtype('U0').type('a').dtype
dtype('<U1')
>>> np.dtype('U0').type('').dtype
dtype('<U')
>>> np.array([''], dtype="U0").dtype
dtype('<U1')
>>> np.array([''], dtype="U0")[0].dtype
dtype('<U')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good from me! Seems to fix the issue I was seeing with Xarray.
for posterity, numpy does permit the creation of arrays with a >>> np.array([b''], dtype="V0").dtype
dtype('V')
>>> np.array([b''], dtype="V0").tobytes()
b''
>>> np.array([b'',b'',b'',b''], dtype="V0").tobytes()
b'' Arrays with this data type are all the empty byte string b"". I don't think this is meaningful for zarr. The changes in this PR will make this data type unrepresentable in zarr python. I hope we are OK with this. |
this PR fixes the issues reported in #3167.
There were two problems driving that issue:
zarr python was not using the string
"str"
to denote a variable length string data type. Instead, in main we use NumPy's behavior and map"str"
to a fixed-length UTF-32 string, which is not a useful data type for most people.In this PR, when a user requests
dtype=str
ordtype="str"
ordtype="string"
they get aVariableLengthUTF8
dtype instead of the nearly-uselessU
dtype.zarr python could create data types for length-0 scalars. Since these data types cannot be associated with arrays that contain any values, they are useless. See disallow 0-length fixed-size data types #3168. This is a quirk of NumPy that we don't need to support. This PR contains logic to ensure that we don't create data types for length-0 scalars.
note that while debugging this, I found that zarr-python 2.x could create arrays with where the dtype in metadata indicates length 0, but the actual array uses a dtype with size 1. We will need some support for this, but I don't think we need that in this PR.