Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Inference on cluster #189

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
4d1383a
adding config file, load from checkpoint
nikk-nikaznan Jun 17, 2024
94761b8
adding inference to toml
nikk-nikaznan Jun 17, 2024
e4f1bac
adding bash script
nikk-nikaznan Jun 18, 2024
0b3ddd9
change variable
nikk-nikaznan Jun 18, 2024
892914e
change variable
nikk-nikaznan Jun 18, 2024
66c22be
naming error
nikk-nikaznan Jun 18, 2024
3fab713
naming error
nikk-nikaznan Jun 18, 2024
2b0d273
fixed import
nikk-nikaznan Jun 18, 2024
85452af
cleaned up sort
nikk-nikaznan Jun 18, 2024
f056b41
add app_wrapper
nikk-nikaznan Jun 18, 2024
8780c36
changed accelerator
nikk-nikaznan Jun 18, 2024
56b74ff
bugs
nikk-nikaznan Jun 18, 2024
a30b0dc
removed accelerator
nikk-nikaznan Jun 18, 2024
918674d
removed accelerator
nikk-nikaznan Jun 18, 2024
2d6da1e
wrong path
nikk-nikaznan Jun 18, 2024
e458c6d
edit path
nikk-nikaznan Jun 19, 2024
29cfea6
adding batches
nikk-nikaznan Jun 19, 2024
ec6886a
debugging oom
nikk-nikaznan Jun 19, 2024
83ed342
save video to false
nikk-nikaznan Jun 19, 2024
d3942ff
save video to false
nikk-nikaznan Jun 19, 2024
2900a9e
adding device
nikk-nikaznan Jun 19, 2024
500d274
revert the batch out
nikk-nikaznan Jun 20, 2024
7260ca8
modify bash script
nikk-nikaznan Jun 20, 2024
def687a
add guide
nikk-nikaznan Jun 21, 2024
1a5d853
debugging
nikk-nikaznan Jun 21, 2024
8ca41c3
fixed codec
nikk-nikaznan Jun 21, 2024
be6cff9
cleaned up
nikk-nikaznan Jun 21, 2024
7117511
adding gt_dir
nikk-nikaznan Jun 21, 2024
45cd8bd
codev revert
nikk-nikaznan Jun 21, 2024
1c56dfc
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jun 21, 2024
6077a7e
adding some logging
nikk-nikaznan Jun 21, 2024
e5d362f
Merge branch 'main' of github.com:SainsburyWellcomeCentre/crabs-explo…
nikk-nikaznan Jun 28, 2024
a114200
cleaned up rebase
nikk-nikaznan Jun 28, 2024
17146ad
some changes based on the new modules
nikk-nikaznan Jun 28, 2024
1e250b0
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jul 4, 2024
3ccc258
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jul 4, 2024
6d22c4f
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jul 8, 2024
bfd97bd
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jul 9, 2024
8284157
adding bash script for running all escape events
nikk-nikaznan Jul 9, 2024
cf04af3
small changes on the bash script
nikk-nikaznan Jul 9, 2024
8d4c5a2
changed to the correct video example
nikk-nikaznan Jul 9, 2024
2ffce7a
changes of guide
nikk-nikaznan Jul 9, 2024
8663563
removed device, already set in code
nikk-nikaznan Jul 9, 2024
9af60ee
check cuda status
nikk-nikaznan Jul 9, 2024
86a309b
modified some path
nikk-nikaznan Jul 9, 2024
3d33730
changes branch to main
nikk-nikaznan Jul 9, 2024
b72b4b3
add args to handle run on directory on the cluster
nikk-nikaznan Jul 10, 2024
2b9973e
add args to handle run on directory on the cluster
nikk-nikaznan Jul 10, 2024
feace52
cleaned up
nikk-nikaznan Jul 10, 2024
7977b48
cleaned up
nikk-nikaznan Jul 10, 2024
bff7606
forgot the args
nikk-nikaznan Jul 10, 2024
9c0a560
Merge branch 'main' into nikkna/inference_cluster
nikk-nikaznan Jul 22, 2024
c5bd870
Update guides/TrackingModelHPC.md
nikk-nikaznan Jul 22, 2024
cd497d7
extension, check dir
nikk-nikaznan Jul 22, 2024
f87814c
Merge branch 'nikkna/inference_cluster' of github.com:SainsburyWellco…
nikk-nikaznan Jul 22, 2024
586d412
Update bash_scripts/run_tracking.sh
nikk-nikaznan Jul 22, 2024
742ee1a
debug
nikk-nikaznan Jul 29, 2024
b96d4fb
debug
nikk-nikaznan Jul 29, 2024
e8d77f0
add log
nikk-nikaznan Jul 29, 2024
5121e45
add log
nikk-nikaznan Jul 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions bash_scripts/run_tracking.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/bin/bash

#SBATCH -p gpu # a100 # partition
#SBATCH --gres=gpu:1
#SBATCH -N 1 # number of nodes
#SBATCH --ntasks-per-node 8 # 2 # max number of tasks per node
#SBATCH --mem 64G # memory pool for all cores
#SBATCH -t 3-00:00 # time (D-HH:MM)
#SBATCH -o slurm.%A.%N.out
#SBATCH -e slurm.%A.%N.err
#SBATCH --mail-type=ALL
#SBATCH [email protected]

# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
source ~/.bashrc

# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# -----------------------------
# Error settings for bash
# -----------------------------
# see https://wizardzines.com/comics/bash-errors/
set -e # do not continue after errors
set -u # throw error if variable is unset
set -o pipefail # make the pipe fail if any part of it fails

# ---------------------
# Define variables
# ----------------------

# video and inference config
VIDEO_PATH=/ceph/zoo/users/sminano/crabs_tracks_label/04.09.2023-04-Right_RE_test/04.09.2023-04-Right_RE_test.mp4
CONFIG_FILE=/ceph/zoo/users/sminano/cluster_tracking_config.yaml

# checkpoint
TRAINED_MODEL_PATH=/ceph/zoo/users/sminano/ml-runs-all/ml_runs-nikkna-copy/243676951438603508/8dbe61069f17453a87c27b4f61f6e681/checkpoints/last.ckpt


# output directory
OUTPUT_DIR=/ceph/zoo/users/sminano/crabs_track_output

# ground truth is available
nikk-nikaznan marked this conversation as resolved.
Show resolved Hide resolved
GT_PATH=/ceph/zoo/users/sminano/crabs_tracks_label/04.09.2023-04-Right_RE_test/04.09.2023-04-Right_RE_test_corrected_ST_csv.csv

# version of the codebase
GIT_BRANCH=main

# -----------------------------
# Create virtual environment
# -----------------------------
module load miniconda

# Define a environment for each job in the
# temporary directory of the compute node
ENV_NAME=crabs-dev-$SLURM_JOB_ID
ENV_PREFIX=$TMPDIR/$ENV_NAME

# create environment
conda create \
--prefix $ENV_PREFIX \
-y \
python=3.10

# activate environment
conda activate $ENV_PREFIX

# install crabs package in virtual env
python -m pip install git+https://github.com/SainsburyWellcomeCentre/crabs-exploration.git@$GIT_BRANCH


# log pip and python locations
echo $ENV_PREFIX
which python
which pip

# print the version of crabs package (last number is the commit hash)
echo "Git branch: $GIT_BRANCH"
conda list crabs
echo "-----"

# ------------------------------------
# GPU specs
# ------------------------------------
echo "Memory used per GPU before training"
echo $(nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used --format=csv) #noheader
echo "-----"

# -------------------
# Run evaluation script
# -------------------
detect-and-track-video \
--trained_model_path $TRAINED_MODEL_PATH \
--video_path $VIDEO_PATH \
--config_file $CONFIG_FILE \
--output_dir $OUTPUT_DIR \
--gt_path $GT_PATH
104 changes: 104 additions & 0 deletions bash_scripts/run_tracking_all_escape_events.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this one! ✨

Maybe going forwards I can combine them to read a dir or a single video, but this is a great starting point

Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/bin/bash

#SBATCH -p gpu # a100 # partition
#SBATCH --gres=gpu:1
#SBATCH -N 1 # number of nodes
#SBATCH --ntasks-per-node 8 # 2 # max number of tasks per node
#SBATCH --mem 64G # memory pool for all cores
#SBATCH -t 3-00:00 # time (D-HH:MM)
#SBATCH -o slurm.%A.%N.out
#SBATCH -e slurm.%A.%N.err
#SBATCH --mail-type=ALL
#SBATCH [email protected]

# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
source ~/.bashrc

# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# -----------------------------
# Error settings for bash
# -----------------------------
# see https://wizardzines.com/comics/bash-errors/
set -e # do not continue after errors
set -u # throw error if variable is unset
set -o pipefail # make the pipe fail if any part of it fails

# ---------------------
# Define variables
# ----------------------

# video and inference config
VIDEO_DIR=/ceph/zoo/raw/CrabField/ramalhete_2023/Escapes
PATTERN="*.mov"
nikk-nikaznan marked this conversation as resolved.
Show resolved Hide resolved
CONFIG_FILE=/ceph/zoo/users/sminano/cluster_tracking_config.yaml

# checkpoint
TRAINED_MODEL_PATH=/ceph/zoo/users/sminano/ml-runs-all/ml_runs-nikkna-copy/243676951438603508/8dbe61069f17453a87c27b4f61f6e681/checkpoints/last.ckpt

# output directory
OUTPUT_DIR=/ceph/zoo/users/sminano/crabs_track_output

# version of the codebase
GIT_BRANCH=main

# Check if the target is not a directory
if [ ! -d "$VIDEO_DIR" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! ✨

exit 1
fi

# -----------------------------
# Create virtual environment
# -----------------------------
module load miniconda

# Define a environment for each job in the
# temporary directory of the compute node
ENV_NAME=crabs-dev-$SLURM_JOB_ID
ENV_PREFIX=$TMPDIR/$ENV_NAME

# create environment
conda create \
--prefix $ENV_PREFIX \
-y \
python=3.10

# activate environment
conda activate $ENV_PREFIX

# install crabs package in virtual env
python -m pip install git+https://github.com/SainsburyWellcomeCentre/crabs-exploration.git@$GIT_BRANCH

# log pip and python locations
echo $ENV_PREFIX
which python
which pip

# print the version of crabs package (last number is the commit hash)
echo "Git branch: $GIT_BRANCH"
conda list crabs
echo "-----"

# ------------------------------------
# GPU specs
# ------------------------------------
echo "Memory used per GPU before training"
echo $(nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used --format=csv) #noheader
echo "-----"

# -------------------
# Run evaluation script for each .mov file in VIDEO_DIR
# -------------------
for VIDEO_PATH in "$VIDEO_DIR"/*.mov; do
nikk-nikaznan marked this conversation as resolved.
Show resolved Hide resolved
echo "Processing video: $VIDEO_PATH"
detect-and-track-video \
--trained_model_path $TRAINED_MODEL_PATH \
--video_path $VIDEO_PATH \
--config_file $CONFIG_FILE \
--output_dir $OUTPUT_DIR
done
5 changes: 5 additions & 0 deletions crabs/tracker/track_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,11 @@ def setup(self):
"""
Load tracking config, trained model and input video path.
"""
# Check for CUDA availability
if self.device == "cuda" and not torch.cuda.is_available():
nikk-nikaznan marked this conversation as resolved.
Show resolved Hide resolved
print("CUDA is not available. Falling back to CPU.")
self.device = "cpu"

with open(self.config_file, "r") as f:
self.config = yaml.safe_load(f)

Expand Down
165 changes: 165 additions & 0 deletions guides/TrackingModelHPC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Evaluate a trained detector model in the cluster
nikk-nikaznan marked this conversation as resolved.
Show resolved Hide resolved

1. **Preparatory steps**

- If you are not connected to the SWC network: connect to the SWC VPN.

1. **Connect to the SWC HPC cluster**

```
ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
ssh hpc-gw1
```

It may ask for your password twice. To set up SSH keys for the SWC cluster, see [this guide](https://howto.neuroinformatics.dev/programming/SSH-SWC-cluster.html#ssh-keys).

1. **Download the training script from the 🦀 repository**

To do so, run any of the following commands. They will download a bash script for tracking (`run_tracking.sh` or `run_tracking_all_escape_events.sh`) to the current working directory.

The download the version of these files in the `main` branch of the [🦀 repository](https://github.com/SainsburyWellcomeCentre/crabs-exploration), run one of the following commands.

- To run video tracking on a specific video: download the `run_tracking.sh` file

```
curl https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/main/bash_scripts/run_tracking.sh > run_tracking.sh
```

- To run video tracking on all escape events (or on a directory): download the `run_tracking_all_escape_events.sh` file

```
curl https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/main/bash_scripts/run_tracking_all_escape_events.sh > run_tracking_all_escape_events.sh
```

These bash scripts will launch a SLURM job that:

- gets the 🦀 package from git,
- installs it in the compute node,
- and runs a video tracking on a specific video.

> [!TIP]
> To retrieve a version of these files that is different from the files at the tip of `main`, edit the remote file path in the curl command:
>
> - For example, to download the version of the file at the tip of a branch called `<BRANCH-NAME>`, edit the path above to replace `main` with `<BRANCH-NAME>`:
> ```
> https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/<BRANCH-NAME>/bash_scripts/run_tracking.sh
> ```
> - To download the version of the file of a specific commit, replace `main` with `blob/<COMMIT-HASH>`:
> ```
> https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/blob/<COMMIT-HASH>/bash_scripts/run_tracking.sh
> ```

4. **Edit the bash script!**

For run the tracker, we need to ensure the correct trained model is used. All the parameters used in any training is logged into `mlflow`.

We can see the perfomance of each training session by inspecting the `metrics` tab in `mlflow UI` where the `training loss`, `validation precision` and `validation recall` are plotted. The trained model (`checkpoint path`) are logged in `parameters` section under `overview` tab.

When launching a tacking job, we may want to edit in the bash script:

- The `TRAINED_MODEL_PATH`
- The `OUTPUT_DIR`
- The `VIDEO_PATH` (for `run_tracking.sh`) or `VIDEO_DIR` (for `run_tracking_all_escape_events.sh`)

Less frequently, one may need to edit:

- the `CONFIG_FILE`: usually we point to the same file we used to train the model at `/ceph/zoo/users/sminano/cluster_tracking_config.yaml` which we can edit.
- the `GIT_BRANCH`, if we want to use a specific version of the 🦀 package. Usually we will run the version of the 🦀 package in `main`.

5. **Other Inference options**

By default, the inference will save the tracking output into a CSV file. There are other options that we can enable in CLI arguments:

- `save_video` : This will save the tracking bounding boxes for every frame into a video output.
- `save_frames` : This will save the corresponding frames to the CSV output. This is needed if we want to correct the tracking labels.

Additionally, if we have ground truth for the video we used, we may want to add that to get the tracking evaluation:

- `GT_PATH`

We can add all these arguments in the bash script, for example:

```
detect-and-track-video \
--checkpoint_path $CKPT_PATH \
--video_path $VIDEO_PATH \
--config_file $CONFIG_FILE \
--gt_PATH $GT_PATH
--device $DEVICE
--save_video
--save_frames
```

6. **Run the inference job using the SLURM scheduler**

To launch a job, use the `sbatch` command with the relevant training script:

```
sbatch <path-to-inference-bash-script>
```

7. **Check the status of the training job**

To do this, we can:

- Check the SLURM logs: these should be created automatically in the directory from which the `sbatch` command is run.
- Run supporting SLURM commands (see [below](#some-useful-slurm-commands)).
- Check the MLFlow logs. To do this, first create or activate an existing conda environment with `mlflow` installed, and then run the `mlflow` command from the login node.

- Create and activate a conda environment.
```
module load miniconda
conda create -n mlflow-env python=3.10 mlflow -y
conda activate mlflow-env
```
- Run `mlflow` to visualise the results logged to the `ml-runs` folder.

- If using the "scratch" folder:

```
mlflow ui --backend-store-uri file:////ceph/zoo/users/sminano/ml-runs-all/ml-runs-scratch
```

- If using the selected runs folder:

```
mlflow ui --backend-store-uri file:////ceph/zoo/users/sminano/ml-runs-all/ml-runs
```

### Some useful SLURM commands

To check the status of your jobs in the queue

```
squeue -u <username>
```

To show details of the latest jobs (including completed or cancelled jobs)

```
sacct -X -u <username>
```

To specify columns to display use `--format` (e.g., `Elapsed`)

```
sacct -X --format="JobID, JobName, Partition, Account, State, Elapsed" -u <username>
```

To check specific jobs by ID

```
sacct -X -j 3813494,3813184
```

To check the time limit of the jobs submitted by a user (for example, `sminano`)

```
squeue -u sminano --format="%i %P %j %u %T %l %C %S"
```

To cancel a job

```
scancel <jobID>
```