Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of database storage #396

Open
ttngu207 opened this issue Aug 21, 2024 · 17 comments
Open

Optimization of database storage #396

ttngu207 opened this issue Aug 21, 2024 · 17 comments
Assignees

Comments

@ttngu207
Copy link
Contributor

Data size stored in mysql server (on aeon-db2) is approaching 400GB
This will hit the storage size limit of the aeon-db2 server, at which point we will need to request for a disk size increase.

I'd like to discuss potential solutions to better engineer/optimize for storage space in our next data architecture meeting.

The bulk of the storage is in the streams schema, where we extract and store the data from relevant streams. Among these, the majority of the storage is at UndergroundFeederEncoder (the wheel data) (~100GB) and SpinnakerVideoSourceVideo (~80GB)

@ttngu207
Copy link
Contributor Author

ttngu207 commented Aug 21, 2024

I'm sharing here the current size (in GB) of each schema in the aeon-db2 database

{'aeon_streams': 186.279919616,
 'aeon_archived_exp02_acquisition': 42.648174592,
 'aeon_worker': 0.092962816,
 'aeon_tracking': 32.597950464,
 'aeon_archived_exp01_analysis': 0.010092544,
 'aeon_analysis': 0.00188416,
 'aeon_archived_exp02_subject': 0.000835584,
 'aeon_archived_exp02_worker': 0.000131072,
 'aeon_report': 4.9152e-05,
 'aeon_block_analysis': 23.938367488,
 'aeon_archived_exp02_report': 0.002752512,
 'aeon_device': 3.2768e-05,
 'u_thinh_aeonfix': 0.00475136,
 'aeon_subject': 0.000638976,
 'aeon_archived_exp02_streams': 0.000507904,
 'aeon_archived_exp01_acquisition': 0.10944512,
 'aeon_archived_exp02_qc': 24.861868032,
 'aeon_archived_exp02_analysis': 4.933648384,
 'aeon_lab': 0.000442368,
 'aeon_archived_exp01_lab': 0.000425984,
 'aeon_archived_exp01_subject': 0.000294912,
 'aeon_archived_exp01_report': 0.041025536,
 'aeon_archived_exp01_qc': 26.919780352,
 'aeon_archived_exp02_lab': 0.000442368,
 'aeon_acquisition': 0.121913344,
 'aeon_archived_exp02_tracking': 9.951330304,
 'aeon_qc': 4.260446208,
 'aeon_archived_exp01_tracking': 1.720942592}

@jkbhagatio
Copy link
Member

Ah yes, I'm happy downsampling the wheel data to 50 hz as something quick we can do immediately

@lochhh
Copy link
Contributor

lochhh commented Aug 28, 2024

@ttngu207 proposed 2 possible ways to downsample the wheel data:

  • downsample as part of ingestion and store this in processed (but this is inconsistent with the current "read-only" way)
  • downsample and store in processed before ingestion

@ttngu207 feel free to expand/correct my comment here

@ttngu207
Copy link
Contributor Author

Hi @jkbhagatio ,
We discussed this issue a bit more with @lochhh and @anayapouget today.

Downsampling the wheel data to 50Hz is a good immediate step we can take to optimize the storage. On this, I think there are 2 ways to accomplish this

  1. Add logic to do down sampling at the step of ingestion with DataJoint.
    • Pros: simple, easy
    • Cons: mismatch in how the data in files are stored vs. how it's stored in DataJoint. For all of the tables in streams, we have been maintaining this consistency, breaking it here is something to think about
  2. Create another set of the downsampled wheel data files in processed folder - this will get ingested
    • Pros: zero change needed in the current ingestion routine, and maintaining the consistency in files and ingested data
    • Cons: additional routine needed to do the downsampling and store downsample files in processed (perhaps together with, or after, the data copy step)

@ttngu207
Copy link
Contributor Author

ttngu207 commented Aug 28, 2024

Technically, at the streams level, we want to maintain consistency between the ingested data with the "read" data - i.e. the dataframe returned by the StreamReader class (not strictly the data in the raw files).
So, perhaps a 3rd option to consider is to update the StreamReader for the wheel data (io.reader.Encoder) to have a downsampling keyword argument (or downsampled_rate), and we use this option for ingestion.

This approach would also require zero change in the current ingestion flow, only need to update the reader in aeon.schemas.schema, and still maintain consistency between data returned by the reader and data ingested

@ttngu207
Copy link
Contributor Author

@jkbhagatio @glopesdev
Let us know your thoughts on this, if this is a reasonable approach, I'll start the implementation.

Currently, data ingestion for social0.4 is paused due to this storage problem

@glopesdev
Copy link
Contributor

@ttngu207 for wheel data, number 3. sounds preferable for now, especially since it avoids data duplication.

For the video CSV it's a bit harder to imagine a compressed scheme, especially since it seems from the numbers that the stored data is already in binary form. The only other option would be to save timestamps only and table indices and pull data on demand post-query if needed.

I can push to implement the downsampling parameter ASAP for wheels since it seems this would unblock ingestion the fastest.

@ttngu207
Copy link
Contributor Author

@glopesdev agreed, I also like option 3 best. The video CSV is definitely trickier, we can deal with that later
It would be great if we can prioritize the downsampling parameter in the io.reader.Encoder, that would allow ingestion to resume.

@jerlich
Copy link
Contributor

jerlich commented Aug 29, 2024

what is the underlying database engine? innodb? is page compression enabled? you can also get savings depending on the underlying file system.

@jkbhagatio
Copy link
Member

@ttngu207 @glopesdev, I'm also happy with option 3 mentioned above

@glopesdev
Copy link
Contributor

@jkbhagatio from our meeting today we were converging on a downsampling strategy and API for the wheel data. Since the encoder reports absolute wheel position we can simply decimate the data to 50 Hz by sampling the raw stream at fixed 20ms intervals inside each 1h chunk (no need for filtering).

This would be implemented as a property at the level of the reader.Encoder class so it can be toggled in the schema (or we can set the default to True).

For the video metadata one possibility we discussed to immediately gain some space is to drop the hw_counter column from the ingested data. We should be able to reconstruct dropped frames at both the TTL and software levels by comparing the camera timestamp with the harp timestamp directly, and it would give us back around 1/3 of the space.

If this sounds good I can submit a PR soon.

@jkbhagatio
Copy link
Member

@glopesdev this sounds good regarding the changes to Encoder, happy for that immediately. Regarding the video metadata, can we discuss the hw_counter in a bit more detail?

@glopesdev
Copy link
Contributor

glopesdev commented Sep 6, 2024

@ttngu207 @jkbhagatio @lochhh I have a working implementation of the downsampling operation at 3340df2.

However, I am now conflicted as to whether having downsample as a property of the reader / stream is a good idea. This came up because of testing. The few tests we have of the low-level API were actually using encoder data on a chunk which has non-monotonic timestamps and after resampling the behavior is changed.

This makes sense, but at the same time made me more aware that some of the more subtle issues with timestamping came from carefully inspecting this low-level raw data and I would like to retain the ability to easily query the raw data if necessary.

The problem is that if we implement downsample as a property at the level of the schema stream, then whoever calls aeon.load with that stream does not have any choice to see the raw data, since the flag is essentially frozen at schema creation time.

Therefore I would like to provide a way to override reader behavior at aeon.load time as well, when necessary. This made me think that a general way to support this behavior could be to simply have readers accept optional **kwargs arguments and have aeon.load forward any kwargs that it receives in its own call to the corresponding reader which is also passed in the function argument.

This would solve my dilemma with the encoder testing and future debugging, but also in general gives us a straightforward way to have options in readers which can be toggled at aeon.load time, rather than only by configuring the reader at schema creation time.

As an illustration, this would give us the following signatures and call (downsampling by default):

Reader

    def read(self, file, downsample=True):
        """Reads encoder data from the specified Harp binary file, and optionally downsamples
        the frequency to 50Hz.
        """

Load

# if we want to look at raw data we can optionally clear the downsample flag
patch_wheel = aeon.load(root, exp02.Patch2.Encoder, epoch=behavior_epoch, downsample=False)

Any thoughts on this? I've opened a PR with a proposed implementation.

@ttngu207
Copy link
Contributor Author

ttngu207 commented Sep 6, 2024

@glopesdev Thanks, the implementation looks good to me.

I'm fine with downsample be an additional argument for load, but it seems a bit one-off and may break the elegancy of the current implementation/usage of .load() (or not)

How about downsample still being a property of the reader, and add the get/set method for this property so it can be changed on the fly before calling the load() (and not frozen at the time of schema creation)?

@ttngu207
Copy link
Contributor Author

ttngu207 commented Sep 6, 2024

I've reviewed the PR, allowing any additional kwargs to be passed on into the load() is also very clean and potentially very useful!

@jkbhagatio
Copy link
Member

jkbhagatio commented Sep 6, 2024

Hi @glopesdev , this makes sense to me conceptually, but I might suggest using a specific arg name (that will be a dict) instead of kwargs, to differentiate args that will be passed to load alone vs. args that will be passed to a read via load.

I think this also implies base class Reader's read should accept this new dict param, that i guess by default can be set to None or an empty dict?

@glopesdev
Copy link
Contributor

@jkbhagatio can you review directly in the PR? Just easier to iterate the code implementation there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants