Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BWM to NWB conversion #841

Open
9 tasks
oliche opened this issue Sep 13, 2024 · 0 comments
Open
9 tasks

BWM to NWB conversion #841

oliche opened this issue Sep 13, 2024 · 0 comments
Assignees

Comments

@oliche
Copy link
Member

oliche commented Sep 13, 2024

  • Conversion should work without streaming, with local data on SDSC
  • Perform a simple read-after write check (number of samples, number of trials etc...)
  • Perform an online read-after write on DANDI

https://github.com/int-brain-lab/IBL-to-nwb

Rationale

We have an instance of the BWM dataset on DANDI here: https://dandiarchive.org/dandiset/000409?search=409&pos=1
And the conversion was done by Catalyst Neuro with this script: https://github.com/catalystneuro/IBL-to-nwb/tree/main/ibl_to_nwb

We have the following requirements:

  • - sessions from the late 2023 release of the BWM are uploaded
  • - new spike sorting revision is uploaded
  • - Allen coordinates and regions are accurate and match the ones in ONE
  • - the raw electrophysiology is synchronized to the behaviour in NWB format (the spiking data should already be)
  • - accessing the behaviour and spikes only takes a reasonable amount of time (on the order of tens of seconds) and does not require an inordinate amount of disk space (ie. we do not want to be streaming a few trials in 120Gb blank files)
  • the cosyne tutorial works with the newer data

So as we are going towards the BWM paper response to reviewers end of September we want to revise this Dandiset by the end of September.

How

Loop over sessions, per session:
1. fetch everything
2. write into .nwb
3. store

We will run this from the San Diego supercomputer SDSC or from AWS.

Discussions about splitting files and metadata and decisions to be made

Briefly, what I was discussing with Ryan was centered around how to logically group data and metadata, and how the nwb format has some flexibility in this. I was arguing that for me, the most intuitive structure would be one in which there are three levels based on the need fo the person that accesses the data, "acquisition related", "raw", and "processed".

  • The person that wants to analyze the data should not need to worry about anything acquisition related or not preprocessed, and should have easy access to spikes, aligned behavior, extracted info etc.
  • The person that wants to reanalyze the raw data, for example with a newer algorithm or so, needs to have access to one level deeper, but does not necessarily care about acquisition details such as amplifier settings and hardware information.
  • While the person that ultimately wants to replicate the experiment needs to have knowledge about the devices and all the contextual metadata that was part of creating the experiment.
    I guess alternative logical groupings could be device centered, where for example all raw data comes from devices (that have all their metadata), and all processed data is then in the hierarchy under the raw data etc.
    The data I was working with (the draft from the DANDI) follows mostly the analysis centric approach I was describing above, but sometimes not entirely. I was wondering what you think of this or what the discussion in the IBL regarding this was, what I should aim for in the conversion process etc. Basically, from where do I start? :)

Converting our data to Dandi is really an outreach effort, and the purpose is to reach users. As such the user centric way to organize the data you propose (analysts, method engineer) makes sense.
In practice there are 2 main difficulties: data size and meta-data complexity.
For data-size, the post-processed data represents 4% of our current data footprint, and this is what most of neuroscientists are interested in. The remaining method weirdos (of which I am a part) will have a much bigger data footprint looking at raw videos and recordings. Here I suggest to split the nwb files in 3 to address this:

  • one neuroscientist package with groomed spikes, behaviour and brain regions
    
  • raw electrophysiology data (AP + LF)
    
  • raw video data
    

I am open to split the raw data more or less: we could have AP, LF and video for instance, it is all a matter of if it is easy to access only part of the files. For example it would be frustrating to have to download all of the AP data to only look at the LF (which represents 1/13 of the raw ephys data size), which many people will be interested in.

For the metadata, it is small but very complex and time consuming. Here I would sick with what Catalyst did and make sure we link the protocols and documentation we have written and published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants