BWM to NWB conversion #841

oliche · 2024-09-13T12:23:01Z

Conversion should work without streaming
with local data
with local data on SDSC
Perform a simple read-after write check (number of samples, number of trials etc...)
Perform an online read-after write on DANDI

https://github.com/int-brain-lab/IBL-to-nwb
(GR: note that this is not the correct repo, main dev is happening here: https://github.com/catalystneuro/IBL-to-nwb)

Rationale

We have an instance of the BWM dataset on DANDI here: https://dandiarchive.org/dandiset/000409?search=409&pos=1
And the conversion was done by Catalyst Neuro with this script: https://github.com/catalystneuro/IBL-to-nwb/tree/main/ibl_to_nwb

We have the following requirements:

- sessions from the late 2023 release of the BWM are uploaded
- new spike sorting revision is uploaded
- Allen coordinates and regions are accurate and match the ones in ONE
- the raw electrophysiology is synchronized to the behaviour in NWB format (the spiking data should already be)
- accessing the behaviour and spikes only takes a reasonable amount of time (on the order of tens of seconds) and does not require an inordinate amount of disk space (ie. we do not want to be streaming a few trials in 120Gb blank files)
the cosyne tutorial works with the newer data

So as we are going towards the BWM paper response to reviewers end of September we want to revise this Dandiset by the end of September.

How

Loop over sessions, per session:
1. fetch everything
2. write into .nwb
3. store

We will run this from the San Diego supercomputer SDSC or from AWS.

Discussions about splitting files and metadata and decisions to be made

Briefly, what I was discussing with Ryan was centered around how to logically group data and metadata, and how the nwb format has some flexibility in this. I was arguing that for me, the most intuitive structure would be one in which there are three levels based on the need fo the person that accesses the data, "acquisition related", "raw", and "processed".

The person that wants to analyze the data should not need to worry about anything acquisition related or not preprocessed, and should have easy access to spikes, aligned behavior, extracted info etc.

The person that wants to reanalyze the raw data, for example with a newer algorithm or so, needs to have access to one level deeper, but does not necessarily care about acquisition details such as amplifier settings and hardware information.

While the person that ultimately wants to replicate the experiment needs to have knowledge about the devices and all the contextual metadata that was part of creating the experiment.
I guess alternative logical groupings could be device centered, where for example all raw data comes from devices (that have all their metadata), and all processed data is then in the hierarchy under the raw data etc.
The data I was working with (the draft from the DANDI) follows mostly the analysis centric approach I was describing above, but sometimes not entirely. I was wondering what you think of this or what the discussion in the IBL regarding this was, what I should aim for in the conversion process etc. Basically, from where do I start? :)

Converting our data to Dandi is really an outreach effort, and the purpose is to reach users. As such the user centric way to organize the data you propose (analysts, method engineer) makes sense.
In practice there are 2 main difficulties: data size and meta-data complexity.
For data-size, the post-processed data represents 4% of our current data footprint, and this is what most of neuroscientists are interested in. The remaining method weirdos (of which I am a part) will have a much bigger data footprint looking at raw videos and recordings. Here I suggest to split the nwb files in 3 to address this:
one neuroscientist package with groomed spikes, behaviour and brain regions
raw electrophysiology data (AP + LF)
raw video data
I am open to split the raw data more or less: we could have AP, LF and video for instance, it is all a matter of if it is easy to access only part of the files. For example it would be frustrating to have to download all of the AP data to only look at the LF (which represents 1/13 of the raw ephys data size), which many people will be interested in.

For the metadata, it is small but very complex and time consuming. Here I would sick with what Catalyst did and make sure we link the protocols and documentation we have written and published.

The text was updated successfully, but these errors were encountered:

grg2rsr · 2024-12-02T13:20:20Z

reach out to heberto about temporal alignment
read after write test for .nwb file containing raw data

grg2rsr · 2024-12-10T11:09:19Z

https://github.com/catalystneuro/IBL-to-nwb/tree/conversion_test
contains the read after write checks

oliche assigned grg2rsr Sep 13, 2024

oliche self-assigned this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BWM to NWB conversion #841

BWM to NWB conversion #841

oliche commented Sep 13, 2024 •

edited by grg2rsr

Loading

grg2rsr commented Dec 2, 2024 •

edited

Loading

grg2rsr commented Dec 10, 2024

BWM to NWB conversion #841

BWM to NWB conversion #841

Comments

oliche commented Sep 13, 2024 • edited by grg2rsr Loading

grg2rsr commented Dec 2, 2024 • edited Loading

grg2rsr commented Dec 10, 2024

oliche commented Sep 13, 2024 •

edited by grg2rsr

Loading

grg2rsr commented Dec 2, 2024 •

edited

Loading