-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of database storage #396
Comments
I'm sharing here the current size (in GB) of each schema in the
|
Ah yes, I'm happy downsampling the wheel data to 50 hz as something quick we can do immediately |
Hi @jkbhagatio , Downsampling the wheel data to 50Hz is a good immediate step we can take to optimize the storage. On this, I think there are 2 ways to accomplish this
|
Technically, at the This approach would also require zero change in the current ingestion flow, only need to update the reader in |
@jkbhagatio @glopesdev Currently, data ingestion for social0.4 is paused due to this storage problem |
@ttngu207 for wheel data, number 3. sounds preferable for now, especially since it avoids data duplication. For the video CSV it's a bit harder to imagine a compressed scheme, especially since it seems from the numbers that the stored data is already in binary form. The only other option would be to save timestamps only and table indices and pull data on demand post-query if needed. I can push to implement the downsampling parameter ASAP for wheels since it seems this would unblock ingestion the fastest. |
@glopesdev agreed, I also like option 3 best. The video CSV is definitely trickier, we can deal with that later |
what is the underlying database engine? innodb? is page compression enabled? you can also get savings depending on the underlying file system. |
@ttngu207 @glopesdev, I'm also happy with option 3 mentioned above |
@jkbhagatio from our meeting today we were converging on a downsampling strategy and API for the wheel data. Since the encoder reports absolute wheel position we can simply decimate the data to 50 Hz by sampling the raw stream at fixed 20ms intervals inside each 1h chunk (no need for filtering). This would be implemented as a property at the level of the For the video metadata one possibility we discussed to immediately gain some space is to drop the If this sounds good I can submit a PR soon. |
@glopesdev this sounds good regarding the changes to |
@ttngu207 @jkbhagatio @lochhh I have a working implementation of the downsampling operation at 3340df2. However, I am now conflicted as to whether having This makes sense, but at the same time made me more aware that some of the more subtle issues with timestamping came from carefully inspecting this low-level raw data and I would like to retain the ability to easily query the raw data if necessary. The problem is that if we implement Therefore I would like to provide a way to override reader behavior at This would solve my dilemma with the encoder testing and future debugging, but also in general gives us a straightforward way to have options in readers which can be toggled at As an illustration, this would give us the following signatures and call (downsampling by default): Reader def read(self, file, downsample=True):
"""Reads encoder data from the specified Harp binary file, and optionally downsamples
the frequency to 50Hz.
""" Load# if we want to look at raw data we can optionally clear the downsample flag
patch_wheel = aeon.load(root, exp02.Patch2.Encoder, epoch=behavior_epoch, downsample=False) Any thoughts on this? I've opened a PR with a proposed implementation. |
@glopesdev Thanks, the implementation looks good to me. I'm fine with How about |
I've reviewed the PR, allowing any additional |
Hi @glopesdev , this makes sense to me conceptually, but I might suggest using a specific arg name (that will be a dict) instead of kwargs, to differentiate args that will be passed to I think this also implies base class |
@jkbhagatio can you review directly in the PR? Just easier to iterate the code implementation there. |
Data size stored in mysql server (on
aeon-db2
) is approaching 400GBThis will hit the storage size limit of the
aeon-db2
server, at which point we will need to request for a disk size increase.I'd like to discuss potential solutions to better engineer/optimize for storage space in our next data architecture meeting.
The bulk of the storage is in the
streams
schema, where we extract and store the data from relevant streams. Among these, the majority of the storage is atUndergroundFeederEncoder
(the wheel data) (~100GB) andSpinnakerVideoSourceVideo
(~80GB)The text was updated successfully, but these errors were encountered: