Review preprocessing. #135

JoeZiminski · 2023-11-16T22:43:23Z

This PR contains the code for running preprocessing on the loaded spikeinterface recording objects.

On data load, the sessions / runs for a single subject are loaded into the data_class.preprocessing.PreprocessingData UserDict in a nested dictionary e.g.:

{
"ses1-name": {
    "run1-name": {"0-raw": Recording},
    "run2-name": {"0-raw:" Recording},
     }
"ses2-name": {
   ...
    }
}

Now in the preprocessing step, Spikeinterface functions as defined in the config yaml are applied to the data and stored on the object in a specific numbered format. e.g. if phase_shift and bandpass filter are appllied, the dictionary becomes:

{"ses1-name": {
   "run1-name": {
        "0-raw": Recording,
        "1-raw-phase_shift: Recording,
        "2-raw-phase_shift-bandpass_filter": Recording
        },
    ...
    },
...
}

This is handled in the function _fill_run_data_with_preprocessed_recording.

After this, the data is saved to disk, by calling a method on the preprocessing class itself.

In the pipeline.preprocess.py, the functions in the section Helpers for preprocessing steps dictionary check and run the actual preprocessing in spikeinterface.

In utils, there is a set of functions around handling the PreprocessingData userdict. Most of these are called later during sorting / postprocessing, but we may want to move some things around so let me know if anything looks out of place, maybe some of these functions should be on base.py.

Running the script

The script example_preprocessing.py is added in examples. The path to the data used in the example preprocessing is on /ceph and should be accessible from your user account (let me know if not). To test the data, cloning the repository to ceph, installing spikewrap and running example_script should run the pipeline (assuming the data is accessible).

Some General Notes / Questions

Currently the preprocess_data first contains raw-data then is updated in-place with the preprocessed data. So it is in fact unpreprocessed, and then after filled with preprocessing. Currently it is just called preprocess_data at all stages but maybe the naming should reflect the change. I'm not sure how confusing this currently is.
PreprocessingData is a UserDict (i.e. dictionary with user-defined functions). In the docstrings I can't decide whether to refer to it as a dict, class, object, userdict 🤷‍♂️

these two points are not specifically related to this PR

The wrapping logic for SLURM is now much clearer (although removed from this PR). Next I tried avoiding use of locals() beacuse it is not robust, but it leads to very long function signatures, which also duplicate arguments which is buggy in it's own right (e.g. may forget to update all instances when adding a new argument). Based on user feedback it is useful to print all passed arguments, in which case the locals() also comes in handle for this. However it might be worth just making run_full_pipeline a class and storing all it's kwargs as attributes. I was thinking to avoid a class entry point if possible but maybe now is the time for it.

e.g.

full_pipeline = RunFullPipeline(kwargs...)
full_pipeline.run()
full_pipeline.get_passed_arguments()
preprocessed_data = full_pipeline.get_preprocess_data()

After some experimenting yesterday I definately agree HPC would also be good as a config only, I couldnt remember any of the arguments when trying to update some with a Dict so kept having to refer to the config anyway. Also, it is likely there will only be a few standard settings that people will setup once and reuse. e.g. hpc="gpu-fast", hpc="cpu-large-mem".

Thanks and please let me know if you have any questions!

JoeZiminski · 2023-11-17T10:11:27Z

spikewrap/pipeline/preprocess.py

+        paths to rawdata. The pp_steps attribute is set on
+        this class during execution of this function.
+
+    pp_steps: The name of valid preprocessing .yaml file (without the yaml extension).


Actually, this can also take a path. In general this pattern is not generalisable because when users pip install spikewrap the spikewrap will be hidden in whever pip installed spikewrap. Better to setup a config path at the start, possibly just a function spikewrap.set_config_path() that will store the path to use for configs in /user/.spikewrap.

JoeZiminski · 2023-11-17T10:13:27Z

spikewrap/data_classes/preprocessing.py

@@ -44,6 +49,19 @@ def __post_init__(self) -> None:
            self.update_two_layer_dict(self, ses_name, run_name, {"0-raw": None})
            self.update_two_layer_dict(self.sync, ses_name, run_name, None)

+    def set_pp_steps(self, pp_steps: Dict) -> None:


Maybe a docstring is not even required for this function.

JoeZiminski · 2023-11-17T10:14:51Z

spikewrap/configs/configs.py

+    available_files = glob.glob((config_dir / "*.yaml").as_posix())
+    available_files = [Path(path_).stem for path_ in available_files]
+
+    if name not in available_files:  # then assume it is a full path


Change this, see comment below on setting a config directory.

JoeZiminski · 2023-11-17T10:16:35Z

spikewrap/pipeline/preprocess.py

+    recording object will be stored. The name of the dict entry will be
+    a concatenenation of all preprocessing steps that were performed.
+
+    e.g. "0-raw", "0-raw_1-phase_shift_2-bandpass_filter"


Suggested change

e.g. "0-raw", "0-raw_1-phase_shift_2-bandpass_filter"

e.g. "0-raw", "2-raw-phase_shift-bandpass_filter"

JoeZiminski changed the title ~~Rev preprocessing~~ Review preprocessing. Nov 16, 2023

JoeZiminski force-pushed the rev_preprocessing branch 5 times, most recently from 152ee9b to 84d5f79 Compare November 17, 2023 00:01

JoeZiminski changed the base branch from reviewed_code to fixes_from_load_data_review November 17, 2023 00:01

JoeZiminski force-pushed the rev_preprocessing branch 3 times, most recently from 0e743a9 to 38c3865 Compare November 17, 2023 00:14

Base automatically changed from fixes_from_load_data_review to reviewed_code November 17, 2023 09:09

JoeZiminski and others added 3 commits November 17, 2023 10:07

Add derivatives and preprocessing path getters to base.py

0a71918

Add preprocessing pipeline and associated configs.

2f87982

Update utils with new functions for preprocessing.

89ea9e7

JoeZiminski force-pushed the rev_preprocessing branch from 38c3865 to 89ea9e7 Compare November 17, 2023 10:08

JoeZiminski commented Nov 17, 2023

View reviewed changes

JoeZiminski requested a review from lauraporta November 17, 2023 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review preprocessing. #135

Review preprocessing. #135

JoeZiminski commented Nov 16, 2023 •

edited

Loading

JoeZiminski Nov 17, 2023 •

edited

Loading

JoeZiminski Nov 17, 2023

JoeZiminski Nov 17, 2023

JoeZiminski Nov 17, 2023

	e.g. "0-raw", "0-raw_1-phase_shift_2-bandpass_filter"
	e.g. "0-raw", "2-raw-phase_shift-bandpass_filter"

Review preprocessing. #135

Are you sure you want to change the base?

Review preprocessing. #135

Conversation

JoeZiminski commented Nov 16, 2023 • edited Loading

Running the script

Some General Notes / Questions

JoeZiminski Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

JoeZiminski Nov 17, 2023

Choose a reason for hiding this comment

JoeZiminski Nov 17, 2023

Choose a reason for hiding this comment

JoeZiminski Nov 17, 2023

Choose a reason for hiding this comment

JoeZiminski commented Nov 16, 2023 •

edited

Loading

JoeZiminski Nov 17, 2023 •

edited

Loading