Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed ArrayV2Metadata parameter names #2270

Open
wants to merge 1 commit into
base: v3
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

For things like dataclasses.repalce to work, we need the parameter names to match the attribute names.

All the name change between what we have in memory and what the specs requires should happen during serialization / deserialization.

Closes #2269

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

For things like dataclasses.repalce to work, we need the parameter names
to match the attribute names.

All the name change between what we have in memory and what the specs
requires should happen during serialization / deserialization.

Closes zarr-developers#2269
@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

Can you show the problem that this PR solves? From my POV it's a good thing that the metadata classes have the same structure as the underlying zarr metadata document(s).

@TomAugspurger
Copy link
Contributor Author

#2269 has the issue. Doing something like a.attrs.put({"key": 0}) or anything that uses dataclasses.replace will fail with

TypeError: ArrayV2Metadata.__init__() got an unexpected keyword argument 'chunk_grid'

From my POV it's a good thing that the metadata classes have the same structure as the underlying zarr metadata document

Would that mean changing the dataclass fields on ArrayV2Metadata to chunks and dtype? I'm guessing we have it this way currently is to support writing code that's generic over ArrayV2Metadata and ArrayMetadata. IMO, that's more important than having the Python data model exactly match the on-disk representation, since the translation should be isolated to just to_dict and from_dict.

@TomAugspurger TomAugspurger marked this pull request as ready for review September 30, 2024 11:40
@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

Would that mean changing the dataclass fields on ArrayV2Metadata to chunks and dtype? I'm guessing we have it this way currently is to support writing code that's generic over ArrayV2Metadata and ArrayMetadata. IMO, that's more important than having the Python data model exactly match the on-disk representation, since the translation should be isolated to just to_dict and from_dict.

oof, thanks for reminding me that ArrayV2Metadata doesn't model the .zarray document completely. TBH I think we should fix that, i.e. change data_type to dtype and chunk_grid to chunks. I will open a PR shortly to do this.

My preference would be that ArrayV3Metadata and ArrayV2Metadata exist for one purpose: to model the metadata documents. It is the job of classes (like AsyncArray) that wrap those metadata documents to provide the generic array-like API. For this reason I'm not a fan of to_dict or from_dict doing any attribute name remapping. Also, remapping the v2 .zarray dtype key to data_type doesn't help the bigger semantic friction between compressor / filters in v2 and codecs in v3. So IMO we should normalize the v2 and v3 data models completely, in one place, and that's exactly what AsyncArray is for.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

I think a design goal for ArrayV2Metadata and ArrayV3Metadata is that they should be as independent as possible from whatever array API we layer on top of them. This ensures that zarr-python can evolve that array API in just one place (the array classes), and also ensures that zarr-python is more useful to libraries that want to work with the Zarr data model, but don't want to use the zarr-python array runtime classes. IMO both of these are weakened if our in-memory array abstraction leaks onto the metadata classes.

@TomAugspurger
Copy link
Contributor Author

I think a design goal for ArrayV2Metadata and ArrayV3Metadata is that they should be as independent as possible from whatever array API we layer on top of them

Mmm, I'm not sure where I stand on that :) I think I agree that it's nice to have something that directly models the stored representation of the data, or is at least trivially convertible to it (and I'd count renaming the key as trivial). But IMO the most important thing is the ability to write code that's generic over v2 and v3. I don't have a strong preference for whether that's achieved by having subtypes with mostly the same interface (ArrayV2Metadata and ArrayV3Metadata using the same field names) or some wrapper class.

I'll leave this on hold for now.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

But IMO the most important thing is the ability to write code that's generic over v2 and v3.

I'm hoping we can have both close models of the metadata documents and achieve this goal. The basic problem is that v2 and v3 are fundamentally different data models, largely due to the filters / compressor vs codecs divide. Because there is no unambiguous mapping from codecs to filters / compressor, I think we can never have a metadata model that offer a truly generic API. But that's OK -- we only have 2 cases to handle, so if metadata is v2 do v2 stuff, else do v3 stuff in the array class seems fine to me. If we had 100 different zarr versions to handle then it would be a different story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants