Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review preprocessing. #135

Open
wants to merge 3 commits into
base: reviewed_code
Choose a base branch
from
Open

Review preprocessing. #135

wants to merge 3 commits into from

Conversation

JoeZiminski
Copy link
Member

@JoeZiminski JoeZiminski commented Nov 16, 2023

This PR contains the code for running preprocessing on the loaded spikeinterface recording objects.

On data load, the sessions / runs for a single subject are loaded into the data_class.preprocessing.PreprocessingData UserDict in a nested dictionary e.g.:

{
"ses1-name": {
    "run1-name": {"0-raw": Recording},
    "run2-name": {"0-raw:" Recording},
     }
"ses2-name": {
   ...
    }
}

Now in the preprocessing step, Spikeinterface functions as defined in the config yaml are applied to the data and stored on the object in a specific numbered format. e.g. if phase_shift and bandpass filter are appllied, the dictionary becomes:

{"ses1-name": {
   "run1-name": {
        "0-raw": Recording,
        "1-raw-phase_shift: Recording,
        "2-raw-phase_shift-bandpass_filter": Recording
        },
    ...
    },
...
}

This is handled in the function _fill_run_data_with_preprocessed_recording.

After this, the data is saved to disk, by calling a method on the preprocessing class itself.

In the pipeline.preprocess.py, the functions in the section Helpers for preprocessing steps dictionary check and run the actual preprocessing in spikeinterface.

In utils, there is a set of functions around handling the PreprocessingData userdict. Most of these are called later during sorting / postprocessing, but we may want to move some things around so let me know if anything looks out of place, maybe some of these functions should be on base.py.

Running the script

The script example_preprocessing.py is added in examples. The path to the data used in the example preprocessing is on /ceph and should be accessible from your user account (let me know if not). To test the data, cloning the repository to ceph, installing spikewrap and running example_script should run the pipeline (assuming the data is accessible).

Some General Notes / Questions

  1. Currently the preprocess_data first contains raw-data then is updated in-place with the preprocessed data. So it is in fact unpreprocessed, and then after filled with preprocessing. Currently it is just called preprocess_data at all stages but maybe the naming should reflect the change. I'm not sure how confusing this currently is.

  2. PreprocessingData is a UserDict (i.e. dictionary with user-defined functions). In the docstrings I can't decide whether to refer to it as a dict, class, object, userdict 🤷‍♂️

these two points are not specifically related to this PR

  1. The wrapping logic for SLURM is now much clearer (although removed from this PR). Next I tried avoiding use of locals() beacuse it is not robust, but it leads to very long function signatures, which also duplicate arguments which is buggy in it's own right (e.g. may forget to update all instances when adding a new argument). Based on user feedback it is useful to print all passed arguments, in which case the locals() also comes in handle for this. However it might be worth just making run_full_pipeline a class and storing all it's kwargs as attributes. I was thinking to avoid a class entry point if possible but maybe now is the time for it.

e.g.

full_pipeline = RunFullPipeline(kwargs...)
full_pipeline.run()
full_pipeline.get_passed_arguments()
preprocessed_data = full_pipeline.get_preprocess_data()
  1. After some experimenting yesterday I definately agree HPC would also be good as a config only, I couldnt remember any of the arguments when trying to update some with a Dict so kept having to refer to the config anyway. Also, it is likely there will only be a few standard settings that people will setup once and reuse. e.g. hpc="gpu-fast", hpc="cpu-large-mem".

Thanks and please let me know if you have any questions!

@JoeZiminski JoeZiminski changed the title Rev preprocessing Review preprocessing. Nov 16, 2023
@JoeZiminski JoeZiminski force-pushed the rev_preprocessing branch 5 times, most recently from 152ee9b to 84d5f79 Compare November 17, 2023 00:01
@JoeZiminski JoeZiminski changed the base branch from reviewed_code to fixes_from_load_data_review November 17, 2023 00:01
@JoeZiminski JoeZiminski force-pushed the rev_preprocessing branch 3 times, most recently from 0e743a9 to 38c3865 Compare November 17, 2023 00:14
Base automatically changed from fixes_from_load_data_review to reviewed_code November 17, 2023 09:09
paths to rawdata. The pp_steps attribute is set on
this class during execution of this function.

pp_steps: The name of valid preprocessing .yaml file (without the yaml extension).
Copy link
Member Author

@JoeZiminski JoeZiminski Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this can also take a path. In general this pattern is not generalisable because when users pip install spikewrap the spikewrap will be hidden in whever pip installed spikewrap. Better to setup a config path at the start, possibly just a function spikewrap.set_config_path() that will store the path to use for configs in /user/.spikewrap.

@@ -44,6 +49,19 @@ def __post_init__(self) -> None:
self.update_two_layer_dict(self, ses_name, run_name, {"0-raw": None})
self.update_two_layer_dict(self.sync, ses_name, run_name, None)

def set_pp_steps(self, pp_steps: Dict) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a docstring is not even required for this function.

available_files = glob.glob((config_dir / "*.yaml").as_posix())
available_files = [Path(path_).stem for path_ in available_files]

if name not in available_files: # then assume it is a full path
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this, see comment below on setting a config directory.

recording object will be stored. The name of the dict entry will be
a concatenenation of all preprocessing steps that were performed.

e.g. "0-raw", "0-raw_1-phase_shift_2-bandpass_filter"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e.g. "0-raw", "0-raw_1-phase_shift_2-bandpass_filter"
e.g. "0-raw", "2-raw-phase_shift-bandpass_filter"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant