Skip to content

Code to run GLM-HMM on IBL and dmdm data using SWC's HPC

Notifications You must be signed in to change notification settings

sumiya-kuroda/glm-hmm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

glm-hmm

Code to run GLM-HMM analysis on dmdm data using Zoe Ashwood's work and SWC's HPC. The code is heavily adapted from Ashwood's repo.

Code is originally made to reproduce figures in "Mice alternate between discrete strategies during perceptual decision-making" from Ashwood, Roy, Stone, IBL, Urai, Churchland, Pouget and Pillow (2020). Note: while this code reproduces the figures/analyses for their paper, the easiest way to get started applying the GLM-HMM to your own data, is with this notebook.

Installation

If you are new to SWC's HPC (like I was), this page is very helpful.

To retrieve the code by cloning the repository:

git clone https://github.com/sumiya-kuroda/glm-hmm.git

Create a conda environment from the environment yaml file by running

conda env create -f environment.yml

Note: this may take a while as ibllib relies on OpenCV, and installing this can take a while. This is only needed for the analysis of IBL dataset, so it is completely optional whether you install them or not.

We use version 0.0.1 of the Bayesian State Space Modeling framework from Scott Linderman's lab to perform GLM-HMM inference. Within the glmhmm environment, install the forked version of the ssm package available here. This is a lightly modified version of the master branch of the ssm package available at https://github.com/lindermanlab/ssm. It is modified so as to handle violation trials as described in Section 4 of Ashwood paper, as well as GLM weights for dmdm dataset. In order to install this version of ssm, follow the instructions provided there, namely:

git clone https://github.com/sumiya-kuroda/ssm.git

cd ssm
git checkout sumiya-hpc # Change the branch
pip install numpy cython
pip install -e .

Getting started

When you ssh to the HPC, you will first enter the login node. No computation-heavy analysis should be done on this node. Here is what I usually do: I run a shell script that contains something like

srun --job-name=jupyter -p gpu --gres=gpu:1 --mem=32G -n 4 --pty bash -l

This will let you login to a compute node, and this node basically become my "server node". I personally use VSCode, so I attach my VS code to this server node. You can always check the status of all the nodes you are using with squeue -u ${swc_user_name}. Then, activate the conda environment by the command below. (This will also load Matlab needed for some preprocessing.)

module load miniconda
module load matlab

conda activate glmhmm

We use Jupyter and dask to get the advantage of HPC (Best Practices). But you can also play around with it. VS code can automatically identify the conda environment, but if this is your first time, you have to let it know where the conda environemnt is. Follow the step described here. We will only use Jupyter when we run GLMs and GLM-HMMs on preprocessed data. The first prerpocessing step does not require Jupyter, and you can simply run on a decent computing node. (Again, not login node!)

Run GLM-HMM

Code is ordered so that the dataset is preprocessed into the desired format and fitting parameters are efficiently initialized.

1_preprocess_data

This creates data folder. Within this directory, run the scripts in the order indicated by the number at the beginning of the file name. For dmdm dataset, run 0_convert_behavior_dataset.sh first. This script converts raw behavior data saved in .mat to Ashwood's GLM-HMM-friendly .npy format. You also need to be cautious that, even for the dmdm dataset, animals' behaviors are stored slightly different between projects. This shell scirpt therefore requires both a path to .mat file as well as which format the .mat file uses. Example usage:

. ./0_convert_behavior_dataset.sh -i ${path_to_raw_data} -f ${file_format}

Then run 1_begin_processing.py and 2_create_design_mat.py to obtain the design matrix used as input for all of the models. If you want to reproduce Ashwood's results on IBL dataset, follow the steps on her repo. If you want to customize the design matrix for GLMs, edit design_mat_utils.py. Example usage:

python 1_begin_processing.py ${filename}
python 2_create_design_mat.py ${filename} -r ${min_number_of_sessions}

2_fit_models

This creates results folder. With the processed dataset, you can fit the GLM, lapse and GLM-HMM models discussed in the paper using the code contained in this folder. Codes for dmdm dataset are organized separately under dmdm/ from Ashwood's original codes for IBL. As discussed in the paper, the GLM should be run first as the GLM fit is used to initialize the global GLM-HMM (the model that is fit with data from all animals). The dmdm dataset has two observations on animal's behavior: outcomes and reaction times. Therefore, we fit these observatond independently using two separate GLMs. Also, because the dmdm task has more discrete outcomes than two binary choices, GLM needs to consider that complexity of the task when fitting and optimizing weights. We therefore use the random utility maximization nested logit (RUMNL) model to basically ask GLM to calculate conditional probabilities for each outcome, by introducing a nested structure of outcomes. Reaction times are fitted with gaussian GLM.

After GLM fitting, you should be able to run GLM-HMM. The global GLM-HMM should be run next, as it explores all the parameter space and find a possible parameter candidate for fitting GLM-HMM to individual animal. As Ashwood described in her paper, we use EM algorithm here, which takes quite a lot of computational resources. EM algorithm does not guarantee that we will find global maxima, so we need to fit with several iterations (which also increases required computation resources). Thus HPC usage is highly recommended.

Finally GLM-HMMs can be fit to the data from individual animals. We also use Maximum A Priori (MAP) Estimation here, as the number of trials available for each animal is very limited. (See here for more explanation.) This involves two additional hyperparameters: alpha and sigma. We use k-fold cross-validation here to find the best pair of them. The Jupyter notebooks for GLM-HMM also generate some basic figures that will make easy to interpret the results.

It is completely optional to fit the lapse model to the dataset. While not used for any initialization purposes, this allows model comparison with the global and individual GLM-HMMs. For now, lapse model does not support dmdm dataset.

3_make_figures

This creates figures folder. Assuming that you have downloaded and preprocessed the datasets, and that you have fit all models on these datasets, you can start some exploratory analysis and reproduce some figures. 3_make_figures/dmdm/write_results.py will save the outputs so that you can run other analysis on trials characterized by GLM-HMM. Enjoy!

Troubleshooting

  • To create symbolic link ln -s ../ssm ssm
  • You want to improve your dask user experience? Check this JupyterLab extension and this VSCode extension.
  • .mat file is in v7.3? You can use this python module to load, but I found this is extremely slow for the size of .mat file we have.
  • Python is here. VSCode asks you to specify the path to it.
  • When using HPC, you need to allocate the computing resources for your analysis. This is not easy-to-solve problem, as there is not inifinite amount of resources on HPC and your jobs can keep pending if you try to use a lot of resources on busy period. Additionally, when I assigned 1 process / 2 cores CPU to each worker, I started to receive distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)t], which I need to explore the cause at some point (Perhaps my code is not optimizd for this type of usage?).
  • Alternatively, you can launch Jupyter server using a different computing node. This is quite useful especially your notebook uses quite a lot of resources and not using Dask to parallelize it. Here is how to do: Login into a separate computing node with srun and launch Jupyter as
export XDG_RUNTIME_DIR="" # See https://swc-neuro.slack.com/archives/CHLGTQVLL/p1560422985006700
jupyter lab --port=2025

Following this procedure will make you able to connect to the Jupter notebook on HPC with http://127.0.0.1:2025/lab using your web browser. I personally use VSCode, so I will simply run this command below from another HPC terminal to port forwarding (ref).

ssh -q -N ${jupyter_host_name} -L 2025:localhost:2025

I then follow this step and simply copy and paste the jupyter URL within HPC.

About

Code to run GLM-HMM on IBL and dmdm data using SWC's HPC

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.1%
  • Python 30.4%
  • MATLAB 1.4%
  • Shell 0.1%