Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bash script for running inference on clips #251

Merged
merged 23 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
db0f8f8
Add draft bash script for running inference on clips
sfmig Nov 14, 2024
c99fa57
Add tracking config
sfmig Nov 14, 2024
462e5cf
Add paths to data and copy config
sfmig Nov 14, 2024
2281de5
Adjust number of jobs in array
sfmig Nov 14, 2024
ce88024
Fix config copy
sfmig Nov 14, 2024
b40ea35
Remove groundtruth annotations!
sfmig Nov 14, 2024
2727e1c
Point to branch with option to remove timestamp in output directory
sfmig Nov 15, 2024
0700e52
Replace parameter expansion for if clauses (bc not working as expected)
sfmig Nov 15, 2024
fce51ab
Fix copying tracking config
sfmig Nov 15, 2024
f0dca32
Save config using video name
sfmig Nov 15, 2024
f3c13d6
Fix copied config path
sfmig Nov 15, 2024
53ae541
Change path to all escape clips and change array job setting to 234 v…
sfmig Nov 15, 2024
780d2d2
Add removal of environment at the end of the script
sfmig Nov 15, 2024
07f849d
Fix precommit
sfmig Nov 15, 2024
323ee82
Add conda deactivate
sfmig Nov 15, 2024
4cbb67d
Add deletion of virtual environment to other bash scripts
sfmig Nov 19, 2024
f65829c
Get Nik's guide from PR 189 to update
sfmig Nov 19, 2024
69343c4
Update guide
sfmig Nov 19, 2024
26cde4b
Small fix to evaluate guide
sfmig Nov 19, 2024
e964410
Remove old version of guide for inference
sfmig Nov 19, 2024
92a10bf
Update git branch to main (should work after PR 253 is merged)
sfmig Nov 19, 2024
a742e5d
Remove source bash step from all bash scripts
sfmig Nov 20, 2024
acb9ca6
Clarify array job syntax
sfmig Nov 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions bash_scripts/run_detect_and_track_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/bin/bash

#SBATCH -p gpu # # partition
#SBATCH --gres=gpu:1 # For any GPU: --gres=gpu:1. For a specific one: --gres=gpu:rtx5000
#SBATCH -N 1 # number of nodes
#SBATCH --ntasks-per-node 8 # 2 # max number of tasks per node
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -t 3-00:00 # time (D-HH:MM)
#SBATCH -o slurm_array.%A-%a.%N.out
#SBATCH -e slurm_array.%A-%a.%N.err
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --array=0-233%25


# NOTE on SBATCH command for array jobs
# with "SBATCH --array=0-n%m" ---> runs n separate jobs, but not more than m at a time.
# the number of array jobs should match the number of input files


# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# -----------------------------
# Error settings for bash
# -----------------------------
# see https://wizardzines.com/comics/bash-errors/
set -e # do not continue after errors
set -u # throw error if variable is unset
set -o pipefail # make the pipe fail if any part of it fails

# ---------------------
# Define variables
# ----------------------

# Path to the trained model
CKPT_PATH="/ceph/zoo/users/sminano/ml-runs-all/ml-runs/317777717624044570/40b1688a76d94bd08175cb380d0a6e0e/checkpoints/last.ckpt"

# Path to the tracking config file
TRACKING_CONFIG_FILE="/ceph/zoo/users/sminano/cluster_tracking_config.yaml"

# List of videos to run inference on: define VIDEOS_DIR and VIDEO_FILENAME
# NOTE: if any of the paths have spaces, put the path in quotes, but stopping and re-starting at the wildcard.
# e.g.: "/ceph/zoo/users/sminano/ml-runs-all/ml-runs-scratch/763954951706829194/"*"/checkpoints"
# e.g.: "checkpoint-epoch="*".ckpt"
# List of videos to run inference on
VIDEOS_DIR="/ceph/zoo/users/sminano/escape_clips_all"
VIDEO_FILENAME=*.mov
mapfile -t LIST_VIDEOS < <(find $VIDEOS_DIR -type f -name $VIDEO_FILENAME)


# Set output directory name
# by default under current working directory
OUTPUT_DIR_NAME="tracking_output_slurm_$SLURM_ARRAY_JOB_ID"

# Select optional output
SAVE_VIDEO=true
SAVE_FRAMES=false


# version of the codebase
GIT_BRANCH=main

# --------------------
# Check inputs
# --------------------
# Check len(list of input data) matches max SLURM_ARRAY_TASK_COUNT
# if not, exit
if [[ $SLURM_ARRAY_TASK_COUNT -ne ${#LIST_VIDEOS[@]} ]]; then
echo "The number of array tasks does not match the number of input videos"
exit 1
fi

# -----------------------------
# Create virtual environment
# -----------------------------
module load miniconda

# Define a environment for each job in the
# temporary directory of the compute node
ENV_NAME=crabs-dev-$SLURM_ARRAY_JOB_ID-$SLURM_ARRAY_TASK_ID
ENV_PREFIX=$TMPDIR/$ENV_NAME

# create environment
conda create \
--prefix $ENV_PREFIX \
-y \
python=3.10

# activate environment
source activate $ENV_PREFIX

# install crabs package in virtual env
python -m pip install git+https://github.com/SainsburyWellcomeCentre/crabs-exploration.git@$GIT_BRANCH

# log pip and python locations
echo $ENV_PREFIX
which python
which pip

# print the version of crabs package (last number is the commit hash)
echo "Git branch: $GIT_BRANCH"
conda list crabs
echo "-----"

# ------------------------------------
# GPU specs
# ------------------------------------
echo "Memory used per GPU before training"
echo $(nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used --format=csv) #noheader
echo "-----"


# -------------------------
# Run evaluation script
# -------------------------
# video used in this job
INPUT_VIDEO=${LIST_VIDEOS[${SLURM_ARRAY_TASK_ID}]}

echo "Running inference on $INPUT_VIDEO using trained model at $CKPT_PATH"

# Set flags based on boolean variables
if [ "$SAVE_FRAMES" = "true" ]; then
SAVE_FRAMES_FLAG="--save_frames"
else
SAVE_FRAMES_FLAG=""
fi

if [ "$SAVE_VIDEO" = "true" ]; then
SAVE_VIDEO_FLAG="--save_video"
else
SAVE_VIDEO_FLAG=""
fi

# run detect-and-track command
# - to save all results from the array job in the same output directory
# we use --output_dir_no_timestamp
# - the output directory is created under SLURM_SUBMIT_DIR by default
detect-and-track-video \
--trained_model_path $CKPT_PATH \
--video_path $INPUT_VIDEO \
--config_file $TRACKING_CONFIG_FILE \
--output_dir $OUTPUT_DIR_NAME \
--output_dir_no_timestamp \
--accelerator gpu \
$SAVE_FRAMES_FLAG \
$SAVE_VIDEO_FLAG



# copy tracking config to output directory
shopt -s extglob # Enable extended globbing

# get tracking config filename without extension
INPUT_VIDEO_NO_EXT="${INPUT_VIDEO##*/}"
INPUT_VIDEO_NO_EXT="${INPUT_VIDEO_NO_EXT%.*}"

cp "$TRACKING_CONFIG_FILE" "$SLURM_SUBMIT_DIR"/"$OUTPUT_DIR_NAME"/"$INPUT_VIDEO_NO_EXT"_config.yaml


echo "Copied $TRACKING_CONFIG_FILE to $OUTPUT_DIR_NAME"


# -----------------------------
# Delete virtual environment
# ----------------------------
conda deactivate
conda remove \
--prefix $ENV_PREFIX \
--all \
-y
16 changes: 10 additions & 6 deletions bash_scripts/run_evaluation_array.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,6 @@
# with "SBATCH --array=0-n%m" ---> runs n separate jobs, but not more than m at a time.
# the number of array jobs should match the number of input files

# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
# source ~/.bashrc


# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
Expand Down Expand Up @@ -144,3 +138,13 @@ evaluate-detector \
--mlflow_folder $MLFLOW_FOLDER \
$USE_TEST_SET_FLAG
echo "-----"


# -----------------------------
# Delete virtual environment
# ----------------------------
conda deactivate
conda remove \
--prefix $ENV_PREFIX \
--all \
-y
15 changes: 9 additions & 6 deletions bash_scripts/run_training_array.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,6 @@
# with "SBATCH --array=0-n%m" ---> runs n separate jobs, but not more than m at a time.
# the number of array jobs should match the number of input files

# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
# source ~/.bashrc


# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
Expand Down Expand Up @@ -115,3 +109,12 @@ train-detector \
--experiment_name $EXPERIMENT_NAME \
--seed_n $SPLIT_SEED \
--mlflow_folder $MLFLOW_FOLDER \

# -----------------------------
# Delete virtual environment
# ----------------------------
conda deactivate
conda remove \
--prefix $ENV_PREFIX \
--all \
-y
16 changes: 9 additions & 7 deletions bash_scripts/run_training_single.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,6 @@
#SBATCH [email protected]


# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
# source ~/.bashrc


# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Expand Down Expand Up @@ -101,3 +94,12 @@ train-detector \
--experiment_name $EXPERIMENT_NAME \
--seed_n $SPLIT_SEED \
--mlflow_folder $MLFLOW_FOLDER \

# -----------------------------
# Delete virtual environment
# ----------------------------
conda deactivate
conda remove \
--prefix $ENV_PREFIX \
--all \
-y
114 changes: 114 additions & 0 deletions guides/DetectAndTrackHPC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Run detection and tracking over a set of videos in the cluster

1. **Preparatory steps**

- If you are not connected to the SWC network: connect to the SWC VPN.

2. **Connect to the SWC HPC cluster**

```
ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
ssh hpc-gw1
```

It may ask for your password twice. To set up SSH keys for the SWC cluster, see [this guide](https://howto.neuroinformatics.dev/programming/SSH-SWC-cluster.html#ssh-keys).

3. **Download the detect+track script from the 🦀 repository**

To do so, run the following command, which will download a bash script called `run_detect_and_track_array.sh` to the current working directory.
```
curl https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/main/bash_scripts/run_detect_and_track_array.sh > run_detect_and_track_array.sh
```

This bash script launches a SLURM array job that runs detection and tracking on an array of videos. The version of the bash script downloaded is the one at the tip of the `main` branch in the [🦀 repository](https://github.com/SainsburyWellcomeCentre/crabs-exploration).


> [!TIP]
> To retrieve a version of the file that is different from the file at the tip of `main`, edit the remote file path in the `curl` command:
>
> - For example, to download the version of the file at the tip of a branch called `<BRANCH-NAME>`, edit the path above to replace `main` with `<BRANCH-NAME>`:
> ```
> https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/<BRANCH-NAME>/bash_scripts/run_detect_and_track_array.sh
> ```
> - To download the version of the file of a specific commit, replace `main` with `blob/<COMMIT-HASH>`:
> ```
> https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/blob/<COMMIT-HASH>/bash_scripts/run_detect_and_track_array.sh
> ```

4. **Edit the bash script if required**

Ideally, we won't make major edits to the bash scripts. If we find we do, then we may want to consider moving the relevant parameters to the config file, or making them a CLI argument.

When launching an array job, we may want to edit the following variables in the detect+track bash script:
- The `CKPT_PATH` variable, which is the path to the trained detector model.
- The `VIDEOS_DIR` variable, which defines the path to the videos directory.
- The `VIDEO_FILENAME` variable, which allows us to define a wildcard expression to select a subset of videos in the directory. See the examples in the bash script comments for the syntax.
- Remember that the number of videos to run inference on needs to match the number of jobs in the array. To change the number of jobs in the array job, edit the line that starts with `#SBATCH --array=0-n%m` and set `n` to the total number of jobs minus 1. The variable `m` refers to the number of jobs that can be run at a time.


Less frequently, one may need to edit:
- the `TRACKING_CONFIG_FILE`, which is the path to the tracking config to use. Usually we point to the file at `/ceph/zoo/users/sminano/cluster_tracking_config.yaml`, which we can edit.
- the `OUTPUT_DIR_NAME`, the name of the output directory in which to save the results. By default it is created under the current working directory and named `tracking_output_slurm_<SLURM_ARRAY_JOB_ID>` (with `SLURM_ARRAY_JOB_ID` being the job ID of the array job).
- the `SAVE_VIDEO` variable, which can be `true` or `false` depending on whether we want to to save the tracked videos or not. Usually set to `true`.
- the `SAVE_FRAMES` variable, which can be `true` or `false` depending on whether we want to to save the untracked full set of frames per video or not. Usually set to `false`.
- the `GIT_BRANCH`, if we want to use a specific version of the 🦀 package. Usually we will run the version of the 🦀 package in `main`.

Currently, there is no option to pass a list of ground truth annotations that matches the set of videos analysed.

> [!CAUTION]
>
> If we launch a job and then modify the config file _before_ the job has been able to read it, we may be using an undesired version of the config in our job! To avoid this, it is best to wait until you can verify that the job has the expected config parameters (and then edit the file to launch a new job if needed).


5. **Run the job using the SLURM scheduler**

To launch a job, use the `sbatch` command with the relevant training script:

```
sbatch <path-to-detect-and-track-bash-script>
```

6. **Check the status of the job**

To do this, we can:

- Check the SLURM logs: these should be created automatically in the directory from which the `sbatch` command is run.
- Run supporting SLURM commands (see [below](#some-useful-slurm-commands)).

### Some useful SLURM commands

To check the status of your jobs in the queue

```
squeue -u <username>
```

To show details of the latest jobs (including completed or cancelled jobs)

```
sacct -X -u <username>
```

To specify columns to display use `--format` (e.g., `Elapsed`)

```
sacct -X --format="JobID, JobName, Partition, Account, State, Elapsed" -u <username>
```

To check specific jobs by ID

```
sacct -X -j 3813494,3813184
```

To check the time limit of the jobs submitted by a user (for example, `sminano`)

```
squeue -u sminano --format="%i %P %j %u %T %l %C %S"
```

To cancel a job

```
scancel <jobID>
```
4 changes: 2 additions & 2 deletions guides/EvaluatingModelsHPC.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@


> [!TIP]
> To retrieve a version of these files that is different from the files at the tip of `main`, edit the remote file path in the curl command:
> To retrieve a version of the file that is different from the file at the tip of `main`, edit the remote file path in the `curl` command:
>
> - For example, to download the version of the file at the tip of a branch called `<BRANCH-NAME>`, edit the path above to replace `main` with `<BRANCH-NAME>`:
> ```
Expand All @@ -42,7 +42,7 @@
When launching an array job, we may want to edit the following variables in the bash script:

- The `MLFLOW_CKPTS_FOLDER` and the `CKPT_FILENAME` variables, define which trained models we would like to evaluate. See the examples in the bash script comments for the syntax.
- The number of trained models to evaluate needs to match the number of jobs in the array. To change the number of jobs in the array job, edit the line that start with `#SBATCH --array=0-n%m`. That command specifies to run `n` separate jobs, but not more than `m` at a time.
- The number of trained models to evaluate needs to match the number of jobs in the array. To change the number of jobs in the array job, edit the line that starts with `#SBATCH --array=0-n%m` and set `n` to the total number of jobs minus 1. The variable `m` refers to the number of jobs that can be run at a time.
- The `MLFLOW_FOLDER`. By default, we point to the "scratch" folder at `/ceph/zoo/users/sminano/ml-runs-all/ml-runs-scratch` . This folder holds runs that we don't need to keep. For runs we would like to keep, we will instead point to the folder at `/ceph/zoo/users/sminano/ml-runs-all/ml-runs`.

Less frequently, one may need to edit:
Expand Down
2 changes: 1 addition & 1 deletion guides/TrainingModelsHPC.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@

Additionally for an array job, one may want to edit the number of jobs in the array (by default set to 3):

- this would mean editing the line that start with `#SBATCH --array=0-n%m` in the `run_training_array.sh` script. That command specifies to run `n` separate jobs, but not more than `m` at a time.
- this would mean editing the line that start with `#SBATCH --array=0-n%m` in the `run_training_array.sh` script. You will need to set `n` to the total number of jobs minus 1. The variable `m` refers to the number of jobs that can be run at a time.
- if the number of jobs in the array is edited, the variable `LIST_SEEDS` needs to be modified accordingly, otherwise we will get an error when launching the job.

1. **Edit the config YAML file if required**
Expand Down