Code to run GLM-HMM analysis on dmdm data using Zoe Ashwood's work and SWC's HPC. The code is heavily adapted from Ashwood's repo.
Code is originally made to reproduce figures in "Mice alternate between discrete strategies during perceptual decision-making" from Ashwood, Roy, Stone, IBL, Urai, Churchland, Pouget and Pillow (2020). Note: while this code reproduces the figures/analyses for their paper, the easiest way to get started applying the GLM-HMM to your own data, is with this notebook.
If you are new to SWC's HPC (like I was), this page is very helpful.
To retrieve the code by cloning the repository:
git clone https://github.com/sumiya-kuroda/glm-hmm.git
Create a conda environment from the environment yaml file by running
conda env create -f environment.yml
Note: this may take a while as ibllib relies on OpenCV, and installing this can take a while. This is only needed for the analysis of IBL dataset, so it is completely optional whether you install them or not.
We use version 0.0.1 of the Bayesian State Space Modeling framework from Scott Linderman's lab to perform GLM-HMM inference. Within the glmhmm
environment, install the forked version of the ssm
package available here. This is a lightly modified version of the master branch of the ssm package available at https://github.com/lindermanlab/ssm. It is modified so as to handle violation trials as described in Section 4 of Ashwood paper, as well as GLM weights for dmdm dataset. In order to install this version of ssm
, follow the instructions provided there, namely:
git clone https://github.com/sumiya-kuroda/ssm.git
cd ssm
git checkout sumiya-hpc # Change the branch
pip install numpy cython
pip install -e .
When you ssh to the HPC, you will first enter the login node. No computation-heavy analysis should be done on this node. Here is what I usually do: I run a shell script that contains something like
srun --job-name=jupyter -p gpu --gres=gpu:1 --mem=32G -n 4 --pty bash -l
This will let you login to a compute node, and this node basically become my "server node". I personally use VSCode, so I attach my VS code to this server node. You can always check the status of all the nodes you are using with squeue -u ${swc_user_name}
.
Then, activate the conda environment by the command below. (This will also load Matlab needed for some preprocessing.)
module load miniconda
module load matlab
conda activate glmhmm
We use Jupyter
and dask
to get the advantage of HPC (Best Practices). But you can also play around with it. VS code can automatically identify the conda environment, but if this is your first time, you have to let it know where the conda environemnt is. Follow the step described here. We will only use Jupyter when we run GLMs and GLM-HMMs on preprocessed data. The first prerpocessing step does not require Jupyter, and you can simply run on a decent computing node. (Again, not login node!)
Code is ordered so that the dataset is preprocessed into the desired format and fitting parameters are efficiently initialized.
This creates data
folder. Within this directory, run the scripts in the order indicated by the number at the beginning of the file name. For dmdm dataset, run 0_convert_behavior_dataset.sh
first. This script converts raw behavior data saved in .mat
to Ashwood's GLM-HMM-friendly .npy
format. You also need to be cautious that, even for the dmdm dataset, animals' behaviors are stored slightly different between projects. This shell scirpt therefore requires both a path to .mat
file as well as which format the .mat
file uses. Example usage:
. ./0_convert_behavior_dataset.sh -i ${path_to_raw_data} -f ${file_format}
Then run 1_begin_processing.py
and 2_create_design_mat.py
to obtain the design matrix used as input for all of the models. If you want to reproduce Ashwood's results on IBL dataset, follow the steps on her repo. If you want to customize the design matrix for GLMs, edit design_mat_utils.py
. Example usage:
python 1_begin_processing.py ${filename}
python 2_create_design_mat.py ${filename} -r ${min_number_of_sessions}
This creates results
folder. With the processed dataset, you can fit the GLM, lapse and GLM-HMM models discussed in the paper using the code contained in this folder. Codes for dmdm dataset are organized separately under dmdm/
from Ashwood's original codes for IBL. As discussed in the paper, the GLM should be run first as the GLM fit is used to initialize the global GLM-HMM (the model that is fit with data from all animals). The dmdm dataset has two observations on animal's behavior: outcomes and reaction times. Therefore, we fit these observatond independently using two separate GLMs. Also, because the dmdm task has more discrete outcomes than two binary choices, GLM needs to consider that complexity of the task when fitting and optimizing weights. We therefore use the random utility maximization nested logit (RUMNL) model to basically ask GLM to calculate conditional probabilities for each outcome, by introducing a nested structure of outcomes. Reaction times are fitted with gaussian GLM.
After GLM fitting, you should be able to run GLM-HMM. The global GLM-HMM should be run next, as it explores all the parameter space and find a possible parameter candidate for fitting GLM-HMM to individual animal. As Ashwood described in her paper, we use EM algorithm here, which takes quite a lot of computational resources. EM algorithm does not guarantee that we will find global maxima, so we need to fit with several iterations (which also increases required computation resources). Thus HPC usage is highly recommended.
Finally GLM-HMMs can be fit to the data from individual animals. We also use Maximum A Priori (MAP) Estimation here, as the number of trials available for each animal is very limited. (See here for more explanation.) This involves two additional hyperparameters: alpha and sigma. We use k-fold cross-validation here to find the best pair of them. The Jupyter notebooks for GLM-HMM also generate some basic figures that will make easy to interpret the results.
It is completely optional to fit the lapse model to the dataset. While not used for any initialization purposes, this allows model comparison with the global and individual GLM-HMMs. For now, lapse model does not support dmdm dataset.
This creates figures
folder. Assuming that you have downloaded and preprocessed the datasets, and that you have fit all models on these datasets, you can start some exploratory analysis and reproduce some figures. 3_make_figures/dmdm/write_results.py
will save the outputs so that you can run other analysis on trials characterized by GLM-HMM. Enjoy!
- To create symbolic link
ln -s ../ssm ssm
- You want to improve your dask user experience? Check this JupyterLab extension and this VSCode extension.
- .mat file is in v7.3? You can use this python module to load, but I found this is extremely slow for the size of .mat file we have.
- Python is here. VSCode asks you to specify the path to it.
- When using HPC, you need to allocate the computing resources for your analysis. This is not easy-to-solve problem, as there is not inifinite amount of resources on HPC and your jobs can keep pending if you try to use a lot of resources on busy period. Additionally, when I assigned 1 process / 2 cores CPU to each worker, I started to receive
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)t]
, which I need to explore the cause at some point (Perhaps my code is not optimizd for this type of usage?). - Alternatively, you can launch Jupyter server using a different computing node. This is quite useful especially your notebook uses quite a lot of resources and not using Dask to parallelize it. Here is how to do: Login into a separate computing node with
srun
and launch Jupyter as
export XDG_RUNTIME_DIR="" # See https://swc-neuro.slack.com/archives/CHLGTQVLL/p1560422985006700
jupyter lab --port=2025
Following this procedure will make you able to connect to the Jupter notebook on HPC with http://127.0.0.1:2025/lab
using your web browser. I personally use VSCode, so I will simply run this command below from another HPC terminal to port forwarding (ref).
ssh -q -N ${jupyter_host_name} -L 2025:localhost:2025
I then follow this step and simply copy and paste the jupyter URL within HPC.