Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLflow logs to evaluate job #220

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,16 @@ evaluate-detector --trained_model_path <path-to-ckpt-file>

This command assumes the trained detector model (a `.ckpt` checkpoint file) is saved in an MLflow database structure. That is, the checkpoint is assumed to be under a `checkpoints` directory, which in turn should be under a `<mlflow-experiment-hash>/<mlflow-run-hash>` directory. This will be the case if the model has been trained using the `train-detector` command.

The `evaluate-detector` command will print to screen the average precision and average recall of the detector on the test set. It will also log those metrics to the MLflow database, along with the hyperparameters of the evaluation job. To visualise the MLflow summary of the evaluation job, run:
The `evaluate-detector` command will print to screen the average precision and average recall of the detector on the validation set by default. To evaluate the model on the test set instead, use the `--use_test_set` flag.

The command will also log those performance metrics to the MLflow database, along with the hyperparameters of the evaluation job. To visualise the MLflow summary of the evaluation job, run:
```
mlflow ui --backend-store-uri file:///<path-to-ml-runs>
```
where `<path-to-ml-runs>` is the path to the directory where the MLflow output is.

The evaluated samples can be inspected visually by exporting them using the `--save__frames` flag. In this case, the frames with the predicted and ground-truth bounding boxes are saved in a directory called `evaluation_output_<timestamp>` under the current working directory.

To see the full list of possible arguments to the `evaluate-detector` command, run it with the `--help` flag.

### Run detector+tracking on a video
Expand All @@ -134,7 +138,7 @@ To track crabs in a new video, using a trained detector and a tracker, run the f
detect-and-track-video --trained_model_path <path-to-ckpt-file> --video_path <path-to-input-video>
```

This will produce a `tracking_output_<timestamp>` directory with the output from tracking.
This will produce a `tracking_output_<timestamp>` directory with the output from tracking under the current working directory.

The tracking output consists of:
- a .csv file named `<video-name>_tracks.csv`, with the tracked bounding boxes data;
Expand Down
116 changes: 116 additions & 0 deletions bash_scripts/run_evaluation_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
#!/bin/bash

#SBATCH -p gpu # a100 # partition
#SBATCH --gres=gpu:1 # gpu:a100_2g.10gb # For any GPU: --gres=gpu:1. For a specific one: --gres=gpu:rtx5000
#SBATCH -N 1 # number of nodes
#SBATCH --ntasks-per-node 8 # 2 # max number of tasks per node
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -t 3-00:00 # time (D-HH:MM)
#SBATCH -o slurm_array.%A-%a.%N.out
#SBATCH -e slurm_array.%A-%a.%N.err
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --array=0-2%3


# NOTE on SBATCH command for array jobs
# with "SBATCH --array=0-n%m" ---> runs n separate jobs, but not more than m at a time.
# the number of array jobs should match the number of input files

# ---------------------
# Source bashrc
# ----------------------
# Otherwise `which python` points to the miniconda module's Python
source ~/.bashrc


# memory
# see https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# -----------------------------
# Error settings for bash
# -----------------------------
# see https://wizardzines.com/comics/bash-errors/
set -e # do not continue after errors
set -u # throw error if variable is unset
set -o pipefail # make the pipe fail if any part of it fails

# ---------------------
# Define variables
# ----------------------

# List of models to evaluate
MLFLOW_CKPTS_FOLDER=/ceph/zoo/users/sminano/ml-runs-all/ml-runs/317777717624044570/fe9a6c2f491a4496aade5034c75316cc/checkpoints
LIST_CKPT_FILES=("$MLFLOW_CKPTS_FOLDER"/*.ckpt)

# selected model
CKPT_PATH=${LIST_CKPT_FILES[${SLURM_ARRAY_TASK_ID}]}

# destination mlflow folder
# EXPERIMENT_NAME="Sept2023" ----> get from training job
MLFLOW_FOLDER=/ceph/zoo/users/sminano/ml-runs-all/ml-runs

# version of the codebase
GIT_BRANCH=main

# --------------------
# Check inputs
# --------------------
# Check len(list of input data) matches max SLURM_ARRAY_TASK_COUNT
# if not, exit
if [[ $SLURM_ARRAY_TASK_COUNT -ne ${#LIST_CKPT_FILES[@]} ]]; then
echo "The number of array tasks does not match the number of .ckpt files"
exit 1
fi

# -----------------------------
# Create virtual environment
# -----------------------------
module load miniconda

# Define a environment for each job in the
# temporary directory of the compute node
ENV_NAME=crabs-dev-$SPLIT_SEED-$SLURM_ARRAY_JOB_ID
ENV_PREFIX=$TMPDIR/$ENV_NAME

# create environment
conda create \
--prefix $ENV_PREFIX \
-y \
python=3.10

# activate environment
conda activate $ENV_PREFIX

# install crabs package in virtual env
python -m pip install git+https://github.com/SainsburyWellcomeCentre/crabs-exploration.git@$GIT_BRANCH


# log pip and python locations
echo $ENV_PREFIX
which python
which pip

# print the version of crabs package (last number is the commit hash)
echo "Git branch: $GIT_BRANCH"
conda list crabs
echo "-----"

# ------------------------------------
# GPU specs
# ------------------------------------
echo "Memory used per GPU before training"
echo $(nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used --format=csv) #noheader
echo "-----"


# -------------------
# Run training script
# -------------------
echo "Evaluating trained model at $CKPT_PATH: "
evaluate-detector \
--trained_model_path $CKPT_PATH \
--accelerator gpu \
--mlflow_folder $MLFLOW_FOLDER \
echo "-----"
144 changes: 105 additions & 39 deletions crabs/detector/evaluate_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import logging
import os
import sys
from pathlib import Path

import lightning
import torch
Expand All @@ -20,9 +21,13 @@
get_cli_arg_from_ckpt,
get_config_from_ckpt,
get_img_directories_from_ckpt,
get_mlflow_experiment_name_from_ckpt,
get_mlflow_parameters_from_ckpt,
)
from crabs.detector.utils.visualization import save_images_with_boxes

logging.getLogger().setLevel(logging.INFO)


class DetectorEvaluate:
"""Interface for evaluating an object detector.
Expand All @@ -39,10 +44,17 @@ def __init__(self, args: argparse.Namespace) -> None:
# CLI inputs
self.args = args

# trained model
# trained model data
self.trained_model_path = args.trained_model_path
trained_model_params = get_mlflow_parameters_from_ckpt(
self.trained_model_path
)
self.trained_model_run_name = trained_model_params["run_name"]
self.trained_model_expt_name = trained_model_params[
"cli_args/experiment_name"
]

# config: retreieve from ckpt if not passed as CLI argument
# config: retrieve from ckpt if not passed as CLI argument
self.config_file = args.config_file
self.config = get_config_from_ckpt(
config_file=self.config_file,
Expand All @@ -61,28 +73,38 @@ def __init__(self, args: argparse.Namespace) -> None:
cli_arg_str="seed_n",
trained_model_path=self.trained_model_path,
)
self.evaluation_split = "test" if self.args.use_test_set else "val"

# Hardware
self.accelerator = args.accelerator

# MLflow
self.experiment_name = args.experiment_name
# MLflow experiment name and run name
self.experiment_name = get_mlflow_experiment_name_from_ckpt(
args=self.args, trained_model_path=self.trained_model_path
)
self.run_name = set_mlflow_run_name()
self.mlflow_folder = args.mlflow_folder

# Debugging
# Debugging settings
self.fast_dev_run = args.fast_dev_run
self.limit_test_batches = args.limit_test_batches

# Log dataset information to screen
logging.info("Dataset")
logging.info(f"Images directories: {self.images_dirs}")
logging.info(f"Annotation files: {self.annotation_files}")
logging.info(f"Seed: {self.seed_n}")
logging.info("---------------------------------")

# Log MLflow information to screen
logging.info("MLflow logs for current job")
logging.info(f"Experiment name: {self.experiment_name}")
logging.info(f"Run name: {self.run_name}")
logging.info(f"Folder: {Path(self.mlflow_folder).resolve()}")
logging.info("---------------------------------")

def setup_trainer(self):
"""Set up trainer object with logging for testing."""
# Assign run name
self.run_name = set_mlflow_run_name()

# Setup logger
mlf_logger = setup_mlflow_logger(
experiment_name=self.experiment_name,
Expand All @@ -91,6 +113,25 @@ def setup_trainer(self):
cli_args=self.args,
)

# Add trained model section to MLflow hyperparameters
mlf_logger.log_hyperparams(
{
"trained_model/experiment_name": self.trained_model_expt_name,
"trained_model/run_name": self.trained_model_run_name,
"trained_model/ckpt_file": Path(self.trained_model_path).name,
}
)

# Add dataset section to MLflow hyperparameters
mlf_logger.log_hyperparams(
{
"dataset/images_dir": self.images_dirs,
"dataset/annotation_files": self.annotation_files,
"dataset/seed": self.seed_n,
"dataset/evaluation_split": self.evaluation_split,
}
)

# Return trainer linked to logger
return lightning.Trainer(
accelerator=self.accelerator,
Expand All @@ -107,26 +148,42 @@ def evaluate_model(self) -> None:
list_annotation_files=self.annotation_files,
split_seed=self.seed_n,
config=self.config,
no_data_augmentation=True,
)

# Get trained model
trained_model = FasterRCNN.load_from_checkpoint(
self.trained_model_path, config=self.config
)

# Run testing
# Evaluate model on either the validation or the test split
trainer = self.setup_trainer()
trainer.test(
trained_model,
data_module,
)
if self.args.use_test_set:
trainer.test(
trained_model,
data_module,
)
else:
trainer.validate(
trained_model,
data_module,
)

# Save images if required
# Save images with bounding boxes if required
if self.args.save_frames:
# get relevant dataloader
if self.args.use_test_set:
eval_dataloader = data_module.test_dataloader()
else:
eval_dataloader = data_module.val_dataloader()

save_images_with_boxes(
test_dataloader=data_module.test_dataloader(),
dataloader=eval_dataloader,
trained_model=trained_model,
output_dir=self.args.frames_output_dir,
output_dir=str(
Path(self.args.frames_output_dir)
/ f"evaluation_output_{self.evaluation_split}"
),
score_threshold=self.args.frames_score_threshold,
)

Expand Down Expand Up @@ -205,7 +262,14 @@ def evaluate_parse_args(args):
"the trained model is used."
),
)

parser.add_argument(
"--use_test_set",
action="store_true",
help=(
"Evaluate the model on the test split, rather than on the default "
"validation split."
),
)
parser.add_argument(
"--accelerator",
type=str,
Expand All @@ -220,35 +284,20 @@ def evaluate_parse_args(args):
parser.add_argument(
"--experiment_name",
type=str,
default="Sept2023_evaluation",
help=(
"Name of the experiment in MLflow, under which the current run "
"will be logged. "
"For example, the name of the dataset could be used, to group "
"runs using the same data. "
"Default: Sept2023_evaluation"
),
)
parser.add_argument(
"--fast_dev_run",
action="store_true",
help="Debugging option to run training for one batch and one epoch",
)
parser.add_argument(
"--limit_test_batches",
type=float,
default=1.0,
help=(
"Debugging option to run training on a fraction of "
"the training set."
"Default: 1.0 (all the training set)"
"By default: <trained_model_mlflow_experiment_name>_evaluation."
),
)
parser.add_argument(
"--mlflow_folder",
type=str,
default="./ml-runs",
help=("Path to MLflow directory. Default: ./ml-runs"),
help=(
"Path to MLflow directory where to log the evaluation data. "
"Default: ./ml-runs"
),
)
parser.add_argument(
"--save_frames",
Expand All @@ -269,12 +318,29 @@ def evaluate_parse_args(args):
type=str,
default="",
help=(
"Output directory for the exported frames. "
"Output directory for the evaluated frames, with bounding boxes. "
"Predicted boxes are plotted in red, and ground-truth boxes in "
"green. "
"By default, the frames are saved in a "
"`results_<timestamp> folder "
"`evaluation_output_<timestamp> folder "
"under the current working directory."
),
)
parser.add_argument(
"--fast_dev_run",
action="store_true",
help="Debugging option to run training for one batch and one epoch",
)
parser.add_argument(
"--limit_test_batches",
type=float,
default=1.0,
help=(
"Debugging option to run training on a fraction of "
"the training set."
"Default: 1.0 (all the training set)"
),
)
return parser.parse_args(args)


Expand Down
Loading