Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a data logging/ingestion format and spec #41

Open
Tracked by #1
bruno-f-cruz opened this issue Mar 19, 2024 · 4 comments
Open
Tracked by #1

Define a data logging/ingestion format and spec #41

bruno-f-cruz opened this issue Mar 19, 2024 · 4 comments
Labels
proposal Request for a new feature

Comments

@bruno-f-cruz
Copy link
Member

bruno-f-cruz commented Mar 19, 2024

Summary

One of the goals of the harp-ecossytem is to define data format and specifications to allow users to log their data in a stable and shareable format.

Current Implementations

At the Allen

The current implementation at the Allen follows the following pattern: https://allenneuraldynamics.github.io/Bonsai.AllenNeuralDynamics/articles/core-logging.html#harp-data

Essentially, all messages from a single device and GroupedBy Register and save in their respective binary file. The name of the binary file current follows the convention <DeviceName__RegisterName.bin>.
e.g.:

├───Behavior.harpRegister__AnalogData.binRegister__AssemblyVersion.binRegister__Camera0Frame.binRegister__Camera0Frequency.binRegister__Camera1Frame.binRegister__Camera1Frequency.binRegister__ClockConfiguration.binRegister__CoreVersionHigh.binRegister__CoreVersionLow.binRegister__DeviceName.bin
.....
├───ClockGenerator.harpRegister__AssemblyVersion.binRegister__Battery.binRegister__BatteryCalibration0.binRegister__BatteryCalibration1.binRegister__BatteryRate.binRegister__BatteryThresholdHigh.binRegister__BatteryThresholdLow.binRegister__ClockConfiguration.binRegister__Config.binRegister__CoreVersionHigh.binRegister__CoreVersionLow.bin
....

This has a few problems:

  1. it does not split by event/read/write. Which might be a problem given the last discussions about Require timestamp sequence to be monotonic when device is synchronized #37
  2. It does not work with the current spec of the harp-python package
  3. It does not include the yml metadata file making it difficult to recover the metadata associated with the device offline

Possible solutions

  • Add a way to add the package metadata to the logging folder
  • Decide on the data logging spec format
  • Should we have a <FolderName.harp> as the root and split all files inside by <MessageType.Register>?
  • Should we label registers as numbers or names?
  • Adopt the following folder structure:
    - <UserGivenName>.harp / <DeviceName>_<RegisterNumber>.bin
@bruno-f-cruz bruno-f-cruz added the proposal Request for a new feature label Mar 19, 2024
@bruno-f-cruz
Copy link
Member Author

One thing that came to mind is why use the <DeviceName> to <UserGivenName>.harp / <DeviceName>_<RegisterNumber>.bin at all. It seems that it just introduces an extra dependency that is not necessary. Maybe a more general name, like Register is better? @glopesdev

@glopesdev
Copy link
Collaborator

@bruno-f-cruz This makes it easier when searching for chunks of the same device across epoch folders, as what happens in the Aeon data formats. I want to keep pushing for this, as I think it is an important use case to keep compatibility for, even though it may not be used in 90% of cases.

@bruno-f-cruz
Copy link
Member Author

bruno-f-cruz commented Mar 24, 2024

I guess my question is whether it should be part of the spec or not. From the Python interface point of view it doesn't appear to add much. I wonder if we can find a way that the interface works as long as the pattern is '*_' or if there is an advantage of introducing this dependency and locking the spec to it. To be clear: I am not against folding it in, just wonder if we really need to add it!

@glopesdev
Copy link
Collaborator

glopesdev commented Nov 1, 2024

@bruno-f-cruz Picking the outstanding issues from this spec:

  1. it does not split by event/read/write. Which might be a problem given the last discussions about Require timestamp sequence to be monotonic when device is synchronized #37

Do we still need this now that harp-python explicitly exposes a message type column (harp-tech/harp-python#11)?

  1. It does not work with the current spec of the harp-python package

If we agree changing the spec to use only register numbers then it should be fully compatible.

  1. It does not include the yml metadata file making it difficult to recover the metadata associated with the device offline

If we adopt the proposal in harp-tech/device.behavior#21 then we will have a trivial way to store the metadata at acquisition time.

Proposed solutions

  • Add a way to add the package metadata to the logging folder (in Include device metadata file as embedded resource device.behavior#21)
  • Should we have a <FolderName.harp> as the root and split all files inside by <MessageType.Register>? (NO)
  • Should we label registers as numbers or names? (numbers)
  • Adopt a standard folder structure (<DataFolderName> / <DeviceName>_<RegisterNumber>.bin)

I think this last default is fine. There are questions of compatibility for projects like Aeon who want to go for multi-chunking of data and have possibly slightly different naming conventions for file layouts. This is fine because the standard folder structure is optional, i.e. it is always possible for projects and APIs to pass the data file path directly, so I don't think we necessarily need to worry too much about this as long as there is a reasonable way forward.

Assuming there is nothing else missing I think we are close to having a complete proposal for the data logging spec format that we could port into the issue description above, and discuss in the next Harp club meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants