Optimization of database storage #396

ttngu207 · 2024-08-21T19:31:38Z

Data size stored in mysql server (on aeon-db2) is approaching 400GB
This will hit the storage size limit of the aeon-db2 server, at which point we will need to request for a disk size increase.

I'd like to discuss potential solutions to better engineer/optimize for storage space in our next data architecture meeting.

The bulk of the storage is in the streams schema, where we extract and store the data from relevant streams. Among these, the majority of the storage is at UndergroundFeederEncoder (the wheel data) (~100GB) and SpinnakerVideoSourceVideo (~80GB)

The text was updated successfully, but these errors were encountered:

ttngu207 · 2024-08-21T19:33:53Z

I'm sharing here the current size (in GB) of each schema in the aeon-db2 database

{'aeon_streams': 186.279919616,
 'aeon_archived_exp02_acquisition': 42.648174592,
 'aeon_worker': 0.092962816,
 'aeon_tracking': 32.597950464,
 'aeon_archived_exp01_analysis': 0.010092544,
 'aeon_analysis': 0.00188416,
 'aeon_archived_exp02_subject': 0.000835584,
 'aeon_archived_exp02_worker': 0.000131072,
 'aeon_report': 4.9152e-05,
 'aeon_block_analysis': 23.938367488,
 'aeon_archived_exp02_report': 0.002752512,
 'aeon_device': 3.2768e-05,
 'u_thinh_aeonfix': 0.00475136,
 'aeon_subject': 0.000638976,
 'aeon_archived_exp02_streams': 0.000507904,
 'aeon_archived_exp01_acquisition': 0.10944512,
 'aeon_archived_exp02_qc': 24.861868032,
 'aeon_archived_exp02_analysis': 4.933648384,
 'aeon_lab': 0.000442368,
 'aeon_archived_exp01_lab': 0.000425984,
 'aeon_archived_exp01_subject': 0.000294912,
 'aeon_archived_exp01_report': 0.041025536,
 'aeon_archived_exp01_qc': 26.919780352,
 'aeon_archived_exp02_lab': 0.000442368,
 'aeon_acquisition': 0.121913344,
 'aeon_archived_exp02_tracking': 9.951330304,
 'aeon_qc': 4.260446208,
 'aeon_archived_exp01_tracking': 1.720942592}

jkbhagatio · 2024-08-21T20:39:46Z

Ah yes, I'm happy downsampling the wheel data to 50 hz as something quick we can do immediately

lochhh · 2024-08-28T14:32:40Z

@ttngu207 proposed 2 possible ways to downsample the wheel data:

downsample as part of ingestion and store this in processed (but this is inconsistent with the current "read-only" way)
downsample and store in processed before ingestion

@ttngu207 feel free to expand/correct my comment here

ttngu207 · 2024-08-28T14:34:20Z

Hi @jkbhagatio ,
We discussed this issue a bit more with @lochhh and @anayapouget today.

Downsampling the wheel data to 50Hz is a good immediate step we can take to optimize the storage. On this, I think there are 2 ways to accomplish this

Add logic to do down sampling at the step of ingestion with DataJoint.
- Pros: simple, easy
- Cons: mismatch in how the data in files are stored vs. how it's stored in DataJoint. For all of the tables in streams, we have been maintaining this consistency, breaking it here is something to think about
Create another set of the downsampled wheel data files in processed folder - this will get ingested
- Pros: zero change needed in the current ingestion routine, and maintaining the consistency in files and ingested data
- Cons: additional routine needed to do the downsampling and store downsample files in processed (perhaps together with, or after, the data copy step)

ttngu207 · 2024-08-28T14:40:22Z

Technically, at the streams level, we want to maintain consistency between the ingested data with the "read" data - i.e. the dataframe returned by the StreamReader class (not strictly the data in the raw files).
So, perhaps a 3rd option to consider is to update the StreamReader for the wheel data (io.reader.Encoder) to have a downsampling keyword argument (or downsampled_rate), and we use this option for ingestion.

This approach would also require zero change in the current ingestion flow, only need to update the reader in aeon.schemas.schema, and still maintain consistency between data returned by the reader and data ingested

ttngu207 · 2024-08-28T14:42:03Z

@jkbhagatio @glopesdev
Let us know your thoughts on this, if this is a reasonable approach, I'll start the implementation.

Currently, data ingestion for social0.4 is paused due to this storage problem

glopesdev · 2024-08-28T15:33:37Z

@ttngu207 for wheel data, number 3. sounds preferable for now, especially since it avoids data duplication.

For the video CSV it's a bit harder to imagine a compressed scheme, especially since it seems from the numbers that the stored data is already in binary form. The only other option would be to save timestamps only and table indices and pull data on demand post-query if needed.

I can push to implement the downsampling parameter ASAP for wheels since it seems this would unblock ingestion the fastest.

ttngu207 · 2024-08-28T15:36:42Z

@glopesdev agreed, I also like option 3 best. The video CSV is definitely trickier, we can deal with that later
It would be great if we can prioritize the downsampling parameter in the io.reader.Encoder, that would allow ingestion to resume.

jerlich · 2024-08-29T11:10:10Z

what is the underlying database engine? innodb? is page compression enabled? you can also get savings depending on the underlying file system.

jkbhagatio · 2024-08-29T14:19:46Z

@ttngu207 @glopesdev, I'm also happy with option 3 mentioned above

glopesdev · 2024-09-04T14:57:24Z

@jkbhagatio from our meeting today we were converging on a downsampling strategy and API for the wheel data. Since the encoder reports absolute wheel position we can simply decimate the data to 50 Hz by sampling the raw stream at fixed 20ms intervals inside each 1h chunk (no need for filtering).

This would be implemented as a property at the level of the reader.Encoder class so it can be toggled in the schema (or we can set the default to True).

For the video metadata one possibility we discussed to immediately gain some space is to drop the hw_counter column from the ingested data. We should be able to reconstruct dropped frames at both the TTL and software levels by comparing the camera timestamp with the harp timestamp directly, and it would give us back around 1/3 of the space.

If this sounds good I can submit a PR soon.

jkbhagatio · 2024-09-05T14:12:23Z

@glopesdev this sounds good regarding the changes to Encoder, happy for that immediately. Regarding the video metadata, can we discuss the hw_counter in a bit more detail?

glopesdev · 2024-09-06T09:52:34Z

@ttngu207 @jkbhagatio @lochhh I have a working implementation of the downsampling operation at 3340df2.

However, I am now conflicted as to whether having downsample as a property of the reader / stream is a good idea. This came up because of testing. The few tests we have of the low-level API were actually using encoder data on a chunk which has non-monotonic timestamps and after resampling the behavior is changed.

This makes sense, but at the same time made me more aware that some of the more subtle issues with timestamping came from carefully inspecting this low-level raw data and I would like to retain the ability to easily query the raw data if necessary.

The problem is that if we implement downsample as a property at the level of the schema stream, then whoever calls aeon.load with that stream does not have any choice to see the raw data, since the flag is essentially frozen at schema creation time.

Therefore I would like to provide a way to override reader behavior at aeon.load time as well, when necessary. This made me think that a general way to support this behavior could be to simply have readers accept optional **kwargs arguments and have aeon.load forward any kwargs that it receives in its own call to the corresponding reader which is also passed in the function argument.

This would solve my dilemma with the encoder testing and future debugging, but also in general gives us a straightforward way to have options in readers which can be toggled at aeon.load time, rather than only by configuring the reader at schema creation time.

As an illustration, this would give us the following signatures and call (downsampling by default):

Reader

    def read(self, file, downsample=True):
        """Reads encoder data from the specified Harp binary file, and optionally downsamples
        the frequency to 50Hz.
        """

Load

# if we want to look at raw data we can optionally clear the downsample flag
patch_wheel = aeon.load(root, exp02.Patch2.Encoder, epoch=behavior_epoch, downsample=False)

Any thoughts on this? I've opened a PR with a proposed implementation.

ttngu207 · 2024-09-06T14:17:53Z

@glopesdev Thanks, the implementation looks good to me.

I'm fine with downsample be an additional argument for load, but it seems a bit one-off and may break the elegancy of the current implementation/usage of .load() (or not)

How about downsample still being a property of the reader, and add the get/set method for this property so it can be changed on the fly before calling the load() (and not frozen at the time of schema creation)?

ttngu207 · 2024-09-06T14:22:24Z

I've reviewed the PR, allowing any additional kwargs to be passed on into the load() is also very clean and potentially very useful!

jkbhagatio · 2024-09-06T15:32:19Z

Hi @glopesdev , this makes sense to me conceptually, but I might suggest using a specific arg name (that will be a dict) instead of kwargs, to differentiate args that will be passed to load alone vs. args that will be passed to a read via load.

I think this also implies base class Reader's read should accept this new dict param, that i guess by default can be set to None or an empty dict?

glopesdev · 2024-09-06T16:26:53Z

@jkbhagatio can you review directly in the PR? Just easier to iterate the code implementation there.

ttngu207 assigned glopesdev, jkbhagatio and lochhh Aug 21, 2024

glopesdev mentioned this issue Sep 6, 2024

Allow forwarding of load function kwargs to reader #401

Merged

glopesdev mentioned this issue Sep 11, 2024

Add support for downsampling encoder data #407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of database storage #396

Optimization of database storage #396

ttngu207 commented Aug 21, 2024

ttngu207 commented Aug 21, 2024 •

edited

Loading

jkbhagatio commented Aug 21, 2024

lochhh commented Aug 28, 2024

ttngu207 commented Aug 28, 2024

ttngu207 commented Aug 28, 2024 •

edited

Loading

ttngu207 commented Aug 28, 2024

glopesdev commented Aug 28, 2024

ttngu207 commented Aug 28, 2024

jerlich commented Aug 29, 2024

jkbhagatio commented Aug 29, 2024

glopesdev commented Sep 4, 2024

jkbhagatio commented Sep 5, 2024

glopesdev commented Sep 6, 2024 •

edited

Loading

ttngu207 commented Sep 6, 2024

ttngu207 commented Sep 6, 2024

jkbhagatio commented Sep 6, 2024 •

edited

Loading

glopesdev commented Sep 6, 2024

Optimization of database storage #396

Optimization of database storage #396

Comments

ttngu207 commented Aug 21, 2024

ttngu207 commented Aug 21, 2024 • edited Loading

jkbhagatio commented Aug 21, 2024

lochhh commented Aug 28, 2024

ttngu207 commented Aug 28, 2024

ttngu207 commented Aug 28, 2024 • edited Loading

ttngu207 commented Aug 28, 2024

glopesdev commented Aug 28, 2024

ttngu207 commented Aug 28, 2024

jerlich commented Aug 29, 2024

jkbhagatio commented Aug 29, 2024

glopesdev commented Sep 4, 2024

jkbhagatio commented Sep 5, 2024

glopesdev commented Sep 6, 2024 • edited Loading

Reader

Load

ttngu207 commented Sep 6, 2024

ttngu207 commented Sep 6, 2024

jkbhagatio commented Sep 6, 2024 • edited Loading

glopesdev commented Sep 6, 2024

ttngu207 commented Aug 21, 2024 •

edited

Loading

ttngu207 commented Aug 28, 2024 •

edited

Loading

glopesdev commented Sep 6, 2024 •

edited

Loading

jkbhagatio commented Sep 6, 2024 •

edited

Loading