Hyperparameter optimization for MLJ machine learning models.
Note: This component of the MLJ stack applies to MLJ versions 0.8.0 and higher. Prior to 0.8.0, tuning algorithms resided in MLJ.
This repository is not intended to be directly imported by the general MLJ user. Rather, MLJTuning is a dependency of the MLJ machine learning platform, which allows MLJ users to perform a variety of hyperparameter optimization tasks from there.
MLJTuning is the place for developers to integrate hyperparameter optimization algorithms (here called tuning strategies) into MLJ, either by adding code to /src/strategies, or by importing MLJTuning into a third-pary package and implementing MLJTuning's tuning strategy interface.
MLJTuning is a component of the MLJ stack which does not have MLJModels as a dependency (no ability to search and load registered MLJ models). It does however depend on MLJBase and, in particular, on the resampling functionality currently residing there.
This repository contains:
-
a tuning wrapper called
TunedModel
for transforming arbitrary MLJ models into "self-tuning" ones - that is, into models which, when fit, automatically optimize a specified subset of the original hyperparameters (using cross-validation or other resampling strategy) before training the optimal model on all supplied data -
an abstract tuning strategy interface to allow developers to conveniently implement common hyperparameter optimization strategies, such as:
-
search a list of explicitly specified models
list = [model1, model2, ...]
-
grid search
-
Latin hypercubes
-
random search
-
simulated annealing
-
Bayesian optimization using Gaussian processes
-
structured tree Parzen estimators
-
multi-objective (Pareto) optimization
-
genetic algorithms
-
AD-powered gradient descent methods
-
-
a selection of implementations of the tuning strategy interface, currently all those accessible from MLJ itself.
-
the code defining the MLJ functions
learning_curves!
andlearning_curve
as these are essentially one-dimensional grid searches
This document assumes familiarity with the Evaluating Model Performance and Performance Measures sections of the MLJ manual. Tuning itself, from the user's perspective, is described in Tuning Models.
What follows is an overview of tuning in MLJ. After the overview is an elaboration on those terms given in italics.
All tuning in MLJ is conceptualized as an iterative procedure, each iteration corresponding to a performance evaluation of a single model. Each such model is a mutated clone of a fixed prototype. In the general case, this prototype is a composite model, i.e., a model with other models as hyperparameters, and while the type of the prototype mutations is fixed, the types of the sub-models are allowed to vary.
When all iterations of the algorithm are complete, the optimal model is selected based entirely on a history generated according to the specified tuning strategy. Iterations are generally performed in batches, which are evaluated in parallel (sequential tuning strategies degenerating into semi-sequential strategies, unless the batch size is one). At the beginning of each batch, both the history and an internal state object are consulted, and, on the basis of the tuning strategy, a new batch of models to be evaluated is generated. On the basis of these evaluations, and the strategy, the history and internal state are updated.
The tuning algorithm initializes the state object before iterations begin, on the basis of the specific strategy and a user-specified range object.
-
Recall that in MLJ a model is an object storing the hyperparameters of some learning algorithm indicated by the name of the model type (e.g.,
DecisionTreeRegressor
). Models do not store learned parameters. -
An evaluation is the value returned by some call to the
evaluate!
method, when passed the resampling strategy (e.g.,CV(nfolds=9)
and performance measures specified by the user when specifying the tuning task (e.g.,cross_entropy
b,accuracy
). Recall that such a value is a named tuple of vectors with keysmeasure
,measurement
,per_fold
, andper_observation
. See Evaluating Model Performance for details. Recall also that some measures in MLJ (e.g.,cross_entropy
) report a loss (or score) for each provided observation, while others (e.g.,auc
) report only an aggregated value (theper_observation
entries being recorded asmissing
). This and other behavior can be inspected using trait functions. Doinfo(rms)
to view the trait values for therms
loss, for example, and see Performance measures for details. -
The history is a vector of tuples generated by the tuning algorithm - one tuple per iteration - used to determine the optimal model and which also records other user-inspectable statistics that may be of interest - for example, evaluations of a measure (loss or score) different from one being explicitly optimized. Each tuple is of the form
(m, r)
, wherem
is a model instance andr
is information aboutm
extracted from an evaluation. -
A tuning strategy is an instance of some subtype
S <: TuningStrategy
, the nameS
(e.g.,Grid
) indicating the tuning algorithm to be applied. The fields of the tuning strategy - called tuning hyperparameters - are those tuning parameters specific to the strategy that do not refer to specific models or specific model hyperparameters. So, for example, a default resolution to be used in a grid search is a hyperparameter ofGrid
, but the resolution to be applied to a specific hyperparameter (such as the maximum depth of a decision tree) is not. This latter parameter would be part of the user-specified range object. -
A range is any object whose specification completes the specification of the tuning task, after the prototype, tuning strategy, resampling strategy, performance measure(s), and total iteration count are given - and is essentially the space of models to be searched. This definition is intentionally broad and the interface places no restriction on the allowed types of this object. For the range objects supported by the
Grid
strategy, see below.
Recall, for context, that in MLJ tuning is implemented as a model wrapper. A model is tuned by fitting the wrapped model to data (which also trains the optimal model on all available data). To use the optimal model one predicts using the wrapped model. For more detail, see the Tuning Models section of the MLJ manual.
In setting up a tuning task, the user constructs an instance of the
TunedModel
wrapper type, which has these principal fields:
-
model
: the prototype model instance mutated during tuning (the model being wrapped) -
tuning
: the tuning strategy, an instance of a concreteTuningStrategy
subtype, such asGrid
-
resampling
: the resampling strategy used for performance evaluations, which must be an instance of a concreteResamplingStrategy
subtype, such asHoldout
orCV
-
measure
: a measure (loss or score) or vector of measures available to the tuning algorithm, the first of which is optimized in the common case of single-objective tuning strategies -
range
: as defined above - roughly, the space of models to be searched -
n
: the number of iterations (number of distinct models to be evaluated) -
acceleration
: the computational resources to be applied (e.g.,CPUProcesses()
for distributed computing andCPUThreads()
for multi-threaded processing) -
acceleration_resampling
: the computational resources to be applied at the level of resampling (e.g., in cross-validation)
Several functions are part of the tuning strategy API:
-
setup
: for initialization of state (compulsory) -
result
: for building each element of the history -
models!
: for generating batches of new models and updating the state (compulsory) -
best
: for extracting the entry in the history corresponding to the optimal model from the full history -
tuning_report
: for selecting what to report to the user apart from details on the optimal model -
default_n
: to specify the number of models to be evaluated whenn
is not specified by the user
Important note on the history. The initialization and update of the
history is carried out internally, i.e., is not the responsibility of
the tuning strategy implementation. The history is always initialized to
nothing
, rather than an empty vector.
The above functions are discussed further below, after discussing types.
Each tuning algorithm must define a subtype of TuningStrategy
whose
fields are the hyperparameters controlling the strategy that do not
directly refer to models or model hyperparameters. These would
include, for example, the default resolution of a grid search, or the
initial temperature in simulated annealing.
The algorithm implementation must include a keyword constructor with defaults. Here's an example:
mutable struct Grid <: TuningStrategy
goal::Union{Nothing,Int}
resolution::Int
shuffle::Bool
rng::Random.AbstractRNG
end
# Constructor with keywords
Grid(; goal=nothing, resolution=10, shuffle=true,
rng=Random.GLOBAL_RNG) =
Grid(goal, resolution, MLJBase.shuffle_and_rng(shuffle, rng)...)
Generally new types are defined for each class of range object a
tuning strategy should like to handle, and the tuning strategy
functions to be implemented are dispatched on these types. Here are
the range objects supported by Grid
:
-
one-dimensional
NumericRange
orNominalRange
objects (these types are provided by MLJBase) -
a tuple
(p, r)
wherep
is one of the above range objects, andr
a resolution to override the defaultresolution
of the strategy -
vectors of objects of the above form, e.g.,
[r1, (r2, 5), r3]
wherer1
andr2
areNumericRange
objects andr3
aNominalRange
object.
Recall that NominalRange
has a values
field, while NominalRange
has the fields upper
, lower
, scale
, unit
and origin
. The
unit
field specifies a preferred length scale, while origin
a
preferred "central value". These default to (upper - lower)/2
and
(upper + lower)/2
, respectively, in the bounded case (neither upper = Inf
nor lower = -Inf
). The fields origin
and unit
are used in
generating grids for unbounded ranges (and could be used in other
strategies - for fitting two-parameter probability distributions, for
example).
A ParamRange
object is always associated with the name of a
hyperparameter (a field of the prototype in the context of tuning)
which is recorded in its field
attribute, but for composite models
this might be a be a "nested name", such as :(atom.max_depth)
.
MLJTuning.result(tuning::MyTuningStrategy, history, e)
This method is for extracting from an evaluation e
of some model m
the value of r
to be recorded in the corresponding tuple (m, r)
of
the history. The value of r
is also allowed to depend on previous
events in the history. The fallback is:
MLJTuning.result(tuning, history, e) = (measure=e.measure, measurement=e.measurement)
Note in this case that the result is always a named tuple of vectors, since multiple measures can be specified (and singleton measures provided by the user are promoted to vectors with a single element).
The history must contain everything needed for the best
method to
determine the optimal model, and everything needed by the
report_history
method, which generates a report on tuning to the
user (for use in visualization, for example). These methods are
detailed below.
state = setup(tuning::MyTuningStrategy, model, range, verbosity)
The setup
function is for initializing the state
of the tuning
algorithm (needed, by the algorithm's models!
method; see below). Be
sure to make this object mutable if it needs to be updated by the
models!
method. The state
generally stores, at the least, the
range or some processed version thereof. In momentum-based gradient
descent, for example, the state would include the previous
hyperparameter gradients, while in GP Bayesian optimization, it would
store the (evolving) Gaussian processes.
If a variable is to be reported as part of the user-inspectable
history, then it should be written to the history instead of stored in
state. An example of this might be the temperature
in simulated
annealing.
The verbosity
is an integer indicating the level of logging: 0
means logging should be restricted to warnings, -1
, means completely
silent.
The fallback for setup
is:
setup(tuning::TuningStrategy, model, range, verbosity) = range
However, a tuning strategy will generally want to implement a setup
method for each range type it is going to support:
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType1, verbosity) = ...
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType2, verbosity) = ...
etc.
MLJTuning.models!(tuning::MyTuningStrategy, model, history, state, verbosity)
This is the core method of a new implementation. Given the existing
history
and state
, it must return a vector ("batch") of new
model instances to be evaluated. Any number of models can be returned
(and this includes an empty vector or nothing
, if models have been
exhausted) and the evaluations will be performed in parallel (using
the mode of parallelization defined by the acceleration
field of the
TunedModel
instance). An update of the history, performed
automatically under the hood, only occurs after these evaluations.
Most sequential tuning strategies will want include the batch size as
a hyperparameter, which we suggest they call batch_size
, but this
field is not part of the tuning interface. In tuning, whatever models
are returned by models!
get evaluated in parallel.
In a Grid
tuning strategy, for example, models!
returns a random
selection of n - length(history)
models from the grid, so that
models!
is called only once (in each call to
MLJBase.fit(::TunedModel, ...)
or MLJBase.update(::TunedModel, ...)
). In a bona fide sequential method which is generating models
non-deterministically (such as simulated annealing), models!
might
return a single model, or return a small batch of models to make use
of parallelization (the method becoming "semi-sequential" in that
case). In sequential methods that generate new models
deterministically (such as those choosing models that optimize the
expected improvement of a surrogate statistical model) models!
would
return a single model.
If the tuning algorithm exhausts it's supply of new models (because,
for example, there is only a finite supply) then models!
should
return an empty vector. Under the hood, there is no fixed "batch-size"
parameter, and the tuning algorithm is happy to receive any number
of models.
MLJTuning.best(tuning::MyTuningStrategy, history)
Returns the entry (best_model, r)
from the history corresponding to
the optimal model best_model
.
A fallback whose definition is given below may be used, provided the
fallback for result
detailed above has not been overloaded. In this
fallback for best
, the best model is the one optimizing performance
estimates for the first measure in the TunedModel
field measure
:
function best(tuning::TuningStrategy, history)
measurements = [h[2].measurement[1] for h in history]
measure = first(history)[2].measure[1]
if orientation(measure) == :score
measurements = -measurements
end
best_index = argmin(measurements)
return history[best_index]
end
As with any model, fitting a TunedModel
instance generates a
user-accessible report. In the case of tuning, the report is
constructed with this code:
report = merge((best_model=best_model, best_result=best_result, best_report=best_report,),
tuning_report(tuning, history, state))
where:
-
best_model
is the optimal model instance -
best_result
is the corresponding "result" entry in the history (e.g., performance evaluation) -
best_report
is the report generated by fitting the optimal model -
tuning_report(::MyTuningStrategy, ...)
is a method the implementer may overload. It should return a named tuple. The fallback is to return the raw history:
MLJTuning.tuning_report(tuning, history, state) = (history=history,)
MLJTuning.default_n(tuning::MyTuningStrategy, range)
The methods!
method (which is allowed to return multiple models) is
called until a history of length n
has been built, or models!
returns an empty list or nothing
. If the user does not specify a
value for n
when constructing her TunedModel
object, then n
is
set to default_n(tuning, range)
at construction, where range
is
the user specified range.
The fallback is
MLJTuning.default_n(::TuningStrategy, range) = 10
The most rudimentary tuning strategy just evaluates every model in a
specified list of models sharing a common type, such lists
constituting the only kind of supported range. (In this special case
range
is an arbitrary iterator of models, which are Probabilistic
or Deterministic
, according to the type of the prototype model
,
which is otherwise ignored.) The fallback implementations for setup
,
result
, best
and report_history
suffice. In particular, there
is not distinction between range
and state
in this case.
Here's the complete implementation:
import MLJBase
mutable struct Explicit <: TuningStrategy end
# models! returns all available models in the range at once:
MLJTuning.models!(tuning::Explicit, model, history::Nothing,
state, verbosity) = state
MLJTuning.models!(tuning::Explicit, model, history,
state, verbosity) = state[length(history) + 1:end]
function MLJTuning.default_n(tuning::Explicit, range)
try
length(range)
catch MethodError
10
end
end
For slightly less trivial example, see /src/strategies/grid.jl