Hyperparameter optimization for MLJ machine learning models.
See Tuning Models · MLJ for usage examples.
- Who is this repo for?
- What's provided here?
- How do I implement a new tuning strategy?
- How do I implement a new selection heuristic?
Note: This component of the MLJ stack applies to MLJ versions 0.8.0 and higher. Prior to 0.8.0, tuning algorithms resided in MLJ.
This repository is not intended to be directly imported by the general MLJ user. Rather, MLJTuning is a dependency of the MLJ machine learning platform, which allows MLJ users to perform a variety of hyperparameter optimization tasks from there.
MLJTuning is the place for developers to integrate hyperparameter optimization algorithms (here called tuning strategies) into MLJ, either by adding code to /src/strategies, or by importing MLJTuning into a third-party package and implementing MLJTuning's tuning strategy interface.
MLJTuning is a component of the MLJ stack which does not have MLJModels as a dependency (no ability to search and load registered MLJ models). It does however depend on MLJBase and, in particular, on the resampling functionality currently residing there.
This repository contains:
-
a tuning wrapper called
TunedModel
for transforming arbitrary MLJ models into "self-tuning" ones - that is, into models which, when fit, automatically optimize a specified subset of the original hyperparameters (using cross-validation or other resampling strategy) before training the optimal model on all supplied data -
an abstract tuning strategy interface to allow developers to conveniently implement common hyperparameter optimization strategies, such as:
-
search models generated by an arbitrary iterator, eg
models = [model1, model2, ...]
(built-inExplicit
strategy) -
grid search (built-in
Grid
strategy) -
Latin hypercubes (built-in
LatinHypercube
strategy, interfacing the LatinHypercubeSampling package) -
random search (built-in
RandomSearch
strategy) -
particle swarm optimization (MLJParticleOptimization.jl)
-
bandit
-
simulated annealing
-
Bayesian optimization using Gaussian processes
-
structured tree Parzen estimators (
MLJTreeParzenTuning
from TreeParzen.jl) -
multi-objective (Pareto) optimization
-
genetic algorithms
-
AD-powered gradient descent methods
-
-
a selection of implementations of the tuning strategy interface, currently all those accessible from MLJ itself.
-
the code defining the MLJ functions
learning_curves!
andlearning_curve
as these are essentially one-dimensional grid searches
This document assumes familiarity with the Evaluating Model Performance and Performance Measures sections of the MLJ manual. Tuning itself, from the user's perspective, is described in Tuning Models.
What follows is an overview of tuning in MLJ. After the overview is an elaboration on those terms given in italics.
All tuning in MLJ is conceptualized as an iterative procedure, each iteration corresponding to a performance evaluation of a single model. Each such model is a mutated clone of a fixed prototype. In the general case, this prototype is a composite model, i.e., a model with other models as hyperparameters, and while the type of the prototype mutations is fixed, the types of the sub-models are allowed to vary.
When all iterations of the algorithm are complete, the optimal model is selected by applying a selection heuristic to a history generated according to the specified tuning strategy. Iterations are generally performed in batches, which are evaluated in parallel (sequential tuning strategies degenerating into semi-sequential strategies, unless the batch size is one). At the beginning of each batch, both the history and an internal state object are consulted, and, on the basis of the tuning strategy, a new batch of models to be evaluated is generated. On the basis of these evaluations, and the strategy, the history and internal state are updated.
The tuning algorithm initializes the state object before iterations begin, on the basis of the specific strategy and a user-specified range object.
-
Recall that in MLJ a model is an object storing the hyperparameters of some learning algorithm indicated by the name of the model type (e.g.,
DecisionTreeRegressor
). Models do not store learned parameters. -
An evaluation is an object
E
returned by some call to theevaluate!
method, when passed the resampling strategy (e.g.,resampling=CV(nfolds=9)
) and a battery of user-specified performance measures (e.g.,measures=[cross_entropy, accuracy]
). An evaluation objectE
contains a list of measuresE.measure
and a list of corresponding measurementsE.measurement
, each of which is the aggregrate of measurements for each resampling of the data, which are stored inE.per_fold
(a vector of vectors). In the case of a measure that reports a value for each individual observation (to obtain the per-fold measurement, by aggregation) the per-observation values can be retrieved fromE.per_observation
. This last object includesmissing
entries for measures that do not report per-observation values (reports_per_observation(measure) = false
) such asauc
. See Evaluating Model Performance for details. There is a trait for measures calledorientation
which is:loss
for measures you ordinarily want to minimize, and:score
for those you want to maximize. See Performance measures for further details. -
A tuning strategy is an instance of some subtype
S <: TuningStrategy
, the nameS
(e.g.,Grid
) indicating the tuning (optimization) algorithm to be applied. The fields of the tuning strategy - called tuning hyperparameters - are those tuning parameters specific to the strategy that do not refer to specific models or specific model hyperparameters. So, for example, a default resolution to be used in a grid search is a hyperparameter ofGrid
, but the resolution to be applied to a specific hyperparameter (such as the maximum depth of a decision tree) is not. This latter parameter would be part of the user-specified range object. A multi-objective tuning strategy is one that consults the measurements of allmeasures
specified by the user; otherwise only the first measure is consulted, although measurements for all measures are nevertheless reported. -
A selection heuristic is a rule describing how the outcomes of the model evaluations will be used to select the best (optimal) model, after all iterations of the optimizer have concluded. For example, the default
NaiveSelection()
heuristic simply selects the model whose evaluationE
has the smallest or largestE.measurement[1]
value, according to whether the metricE.measure[1]
is a:loss
or:score
. Most heuristics are generic in the sense they will apply no matter what tuning strategy is applied. A selection heuristic supported by a multi-objective tuning strategy must select some "best" model (e.g., a random Pareto optimal solution). -
The history is a vector of identically-keyed named tuples, one tuple per iteration. The tuple keys include:
-
model
: for the MLJ model instance that has been evaluated -
measure
,measurement
,per_fold
: for storing the values ofE.measure
,E.measurement
andE.per_fold
, whereE
is the corresponding evaluation object. -
metadata
: for any tuning strategy-specific information required to be recorded in the history but not intended to be reported to the user (for example an implementation-specific representation ofmodel
).
There may be additional keys for tuning-specific information that is to be reported to the user (such as temperature in simulated annhealing).
-
-
A range is any object whose specification completes the specification of the tuning task, after the prototype, tuning strategy, resampling strategy, performance measure(s), selection heuristic, and total iteration count are given. This definition is intentionally broad and the interface places no restriction on the allowed types of this object, although all strategies should support the one-dimensional range objects defined in
MLJBase
(see below). Generally, a range may be understood as the "space" of models being searched plus strategy-specific data explaining how models from that space are actually to be generated (e.g., grid resolutions or probability distributions specific to particular hyper-parameters). For more on range types see Range types below.
Recall, for context, that in MLJ tuning is implemented as a model wrapper. A model is tuned by fitting the wrapped model to data (which also trains the optimal model on all available data). This process determines the optimal model, as defined by the selection heuristic (see above). To use the optimal model one predicts using the wrapped model. For more detail, see the Tuning Models section of the MLJ manual.
In setting up a tuning task, the user constructs an instance of the
TunedModel
wrapper type, which has these principal fields:
-
model
: the prototype model instance mutated during tuning (the model being wrapped) -
tuning
: the tuning strategy, an instance of a concreteTuningStrategy
subtype, such asGrid
-
resampling
: the resampling strategy used for performance evaluations, which must be an instance of a concreteResamplingStrategy
subtype, such asHoldout
orCV
-
measure
: a measure (loss or score) or vector of measures available to the tuning algorithm, the first of which is optimized in the common case of single-objective tuning strategies -
selection_heuristic
: some instance ofSelectionHeuristic
, such asNaiveSelection()
(default) -
range
: as defined above - roughly, the space of models to be searched -
n
: the number of iterations, which is the number of distinct model evaluations that will be added to the history, unless the tuning strategy's supply of models is exhausted (e.g.,Grid
). This is not to be confused with an iteration count specific to the tuning strategy (e.g., Particle Swarm Optimization). -
acceleration
: the computational resources to be applied (e.g.,CPUProcesses()
for distributed computing andCPUThreads()
for multi-threaded processing) -
acceleration_resampling
: the computational resources to be applied at the level of resampling (e.g., in cross-validation)
As sample implementations, see /src/strategies/
Several functions are part of the tuning strategy API:
-
clean!
: for validating and resetting invalid fields in tuning strategy (optional) -
setup
: for initialization of state (compulsory) -
extras
: for declaring and formatting additional user-inspectable information going into the history -
tuning_report
: for declaring any other strategy-specific information to report to the user (optional) -
models
: for generating batches of new models and updating the state (compulsory) -
default_n
: to specify the total number of models to be evaluated whenn
is not specified by the user -
supports_heuristic
: a trait used to encode which selection heuristics are supported by the tuning strategy (only needed if you define a strategy-specific heuristic)
Important note on the history. The initialization and update of the
history is carried out internally, i.e., is not the responsibility of
the tuning strategy implementation. The history is always initialized to
nothing
, rather than an empty vector.
The above functions are discussed further below, after discussing types.
Each tuning algorithm must define a subtype of TuningStrategy
whose
fields are the hyperparameters controlling the strategy that do not
directly refer to models or model hyperparameters. These would
include, for example, the default resolution of a grid search, or the
initial temperature in simulated annealing.
The algorithm implementation must include a keyword constructor with defaults. Here's an example:
mutable struct Grid <: TuningStrategy
goal::Union{Nothing,Int}
resolution::Int
shuffle::Bool
rng::Random.AbstractRNG
end
# Constructor with keywords
Grid(; goal=nothing, resolution=10, shuffle=true,
rng=Random.GLOBAL_RNG) =
Grid(goal, resolution, MLJBase.shuffle_and_rng(shuffle, rng)...)
Generally new types are defined for each class of range object a tuning strategy should like to handle, and the tuning strategy functions to be implemented are dispatched on these types. It is recommended that every tuning strategy support at least these types:
-
one-dimensional ranges
r
, wherer
is aMLJBase.ParamRange
instance -
(optional) pairs of the form
(r, data)
, wheredata
is extra hyper-parameter-specific information, such as a resolution in a grid search, or a distribution in a random search -
abstract vectors whose elements are of the above form
Recall that ParamRange
has two concrete subtypes NumericRange
and
NominalRange
, whose instances are constructed with the MLJBase
extension to the range
function.
Note in particular that a NominalRange
has a values
field, while
NumericRange
has the fields upper
, lower
, scale
, unit
and
origin
. The unit
field specifies a preferred length scale, while
origin
a preferred "central value". These default to (upper - lower)/2
and (upper + lower)/2
, respectively, in the bounded case
(neither upper = Inf
nor lower = -Inf
). The fields origin
and
unit
are used in generating grids or fitting probability
distributions to unbounded ranges.
A ParamRange
object is always associated with the name of a
hyperparameter (a field of the prototype in the context of tuning)
which is recorded in its field
attribute, a Symbol
, but for
composite models this might be a be an Expr
, such as
:(atom.max_depth)
.
Use the iterator
and sampler
methods to convert ranges into
one-dimensional grids or for random sampling, respectively. See the
tuning
section
of the MLJ manual or doc-strings for more on these methods and the
Grid
and RandomSearch
implementations.
MLJTuning.clean!(tuning::MyTuningStrategy)
Some tuning strategies have mutable fields that only support specific set of
values: a particle swarm strategy, for instance, should have at least three
agents for the algorithm to work. As such, it is recommended to implement the
clean!
method to warn the user and correct invalid tuning hyperparameters.
The method should return a string message if some fields have been reset or an
empty string otherwise, and will be called internally whenever a TunedModel
machine is fit!
.
The default fallback for clean!
returns an empty string.
state = setup(tuning::MyTuningStrategy, model, range, n, verbosity)
The setup
function is for initializing the state
of the tuning
algorithm (available to the models
method). Be sure to make this
object mutable if it needs to be updated by the models
method. The
arguments model
and n
are what the user has specified in their
TunedModel
instance; recall model
is the prototype to be cloned
and mutated, and n
the total number of mutations to be generated.
The state
is a place to record the outcomes of any necessary
intialization of the tuning algorithm (performed by setup
) and a
place for the models
method to save and read transient information
that does not need to be recorded in the history.
The setup
function is called once only, when a TunedModel
machine
is fit!
the first time, and not on subsequent calls (unless
force=true
). (Specifically, MLJBase.fit(::TunedModel, ...)
calls
setup
but MLJBase.update(::TunedModel, ...)
does not.)
The verbosity
is an integer indicating the level of logging: 0
means logging should be restricted to warnings, -1
, means completely
silent.
The fallback for setup
is:
setup(tuning::TuningStrategy, model, range, n, verbosity) = range
However, a tuning strategy will generally want to implement a setup
method for each range type it is going to support:
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType1, n, verbosity) = ...
MLJTuning.setup(tuning::MyTuningStrategy, model, range::RangeType2, n, verbosity) = ...
etc.
MLJTuning.extras(tuning::MyTuningStrategy, history, state, E) -> named_tuple
This method should return any user-inspectable information to be
included in a new history entry, that is in addition to the model
,
measures
, measurement
and per_fold
data.
This method must return a named tuple,
human readable if possible. Each key of the
returned named tuple becomes a key of the new history entry.
Here E
is the full evalutation object for model
and history
the
current history (before adding the new entry).
The fallback for extras
returns an empty named tuple.
MLJTuning.models(tuning::MyTuningStrategy, model, history, state, n_remaining, verbosity)
-> vector_of_models, new_state
This is the core method of a new implementation. Given the existing
history
and state
, it must return a vector ("batch") of new
model instances vector_of_models
to be evaluated, and the updated
state
. Any number of models may be returned (and this includes an
empty vector or nothing
) and the evaluations will be performed in
parallel (using the mode of parallelization defined by the
acceleration
field of the TunedModel
instance).
Important note. Parallelization means the order in which the
history
gets extended after models(...)
returns its list of new
candidates is generally not the same order in which the candidates
are returned. Some implementations may therefore need to attach extra
"labeling" metadata to each model, as explained below, so that the
existing history can be suitably interpreted.
If more models are returned than needed (because including them would
create a history whose length exceeds the user-specified number of
iterations tuned_model.n
) then the surplus models are saved, for use
in a "warm
restart"
of tuning, when the user increases tuned_model.n
. The remaining
models are then evaluated and these evaluations are added to the
history. In any warm restart, no new call to models
will be made
until all saved models have been evaluated, and these evaluations
added to the history.
If the tuning algorithm exhausts it's supply of new models (because,
for example, there is only a finite supply, as in a Grid
search)
then vector_of_models
should be an empty vector or nothing
. The
interface has no fixed "batch-size" parameter, and the tuning
algorithm is happy to receive any number of models; a surplus is
handled as explained above, a shortfall will trigger an early stop
(so that the final history has length less than tuned_model.n
).
If needed, extra metadata may be attached to each model returned; see below.
Sequential tuning strategies generating models non-deterministically
(e.g., simulated annealing) might choose to include a batch size
hyperparameter, and arrange that models
returns batches of the
specified size (to be evaluated in parallel when acceleration
is set
appropriately). However, the evaluations and history updates do not
occur until after the models
call, so it may be complicated or
impossible to preserve the original (strictly) sequential algorithm in
that case, which should be clearly documented.
Some simple tuning strategies, such as RandomSearch
, will want to
return as many models as possible in one hit. To this end, the
variable n_remaining
is passed to the models
call; this is the
difference between the current length of the history and
tuned_model.n
.
If a tuning strategy implementation needs to record additional
metadata in the history, for each model generated, then instead of
model instances, vector_of_models
should be vector of tuples of the
form (m, metadata)
, where m
is a model instance, and metadata
the associated data. To access the metadata for the j
th element of
the existing history, use history[j].metadata
.
As with any model, fitting a TunedModel
instance generates a
user-accessible report. Note that the fallback report already includes
additions to the history created by the extras
method mentioned
above. To add more strategy-specific information to the report, one
overloads tuning_report
.
Specically, the report generated by fitting a TunedModel
is
constructed with this code:
report1 = (best_model = best_model,
best_history_entry = best_user_history_entry,
history = user_history,
best_report = best_report)
report = merge(report1, tuning_report(tuning, history, state))
where:
-
best_model
is the best model instance (as selected according to the user-specifiedselection heuristic
). -
best_user_history
is the corresponding entry in the history withmetadata
removed. -
best_report
is the report generated when fitting thebest_model
on all available data. -
user_history
is the full history withmetadata
entries removed. -
tuning_report(::MyTuningStrategy, ...)
is a method the implementer may overload that ***must return a named tuple, preferably human readable
The fallback for tuning_report
returns an empty named-tuple.
MLJTuning.default_n(tuning::MyTuningStrategy, range)
The models
method, which is allowed to return multiple models in
it's first return value vector_of_models
, is called until one of the
following occurs:
-
The length of the history matches the number of iterations specified by the user, namely
tuned_model.n
wheretuned_model
is the user'sTunedModel
instance. Iftuned_model.n
isnothing
(because the user has not specified a value) thendefault_n(tuning, range)
is used instead. -
vector_of_models
is empty ornothing
.
The fallback is
default_n(tuning::TuningStrategy, range) = DEFAULT_N
where DEFAULT_N
is a global constant. Do using MLJTuning; MLJTuning.DEFAULT_N
to see check the current value.
If you define a selection heuristic SpecialHeuristic
(see
below) and that
heuristic is specific to a tuning strategy TuningStrategy
then you
must define
MLJTuning.supports_heuristic(::TuningStrategy, ::SpecialHeuristic) = true
A number of built-in tuning strategy implementations of the MLJTuning API can be found at /src/strategies.
The simplest Explicit
strategy is the simplest but is also an odd
case as the range
is just an iterator of MLJ models. These models
need not share a common type and model
is never cloned.
For slightly less trivial example, see
the Grid
search code
Recall that a selection heuristic is a rule which decides on the
"best model" given the model evaluations in the tuning history. New
heuristics are introduced by defining a new struct SomeHeuristic
subtyping
SelectionHeuristic
and implementing a method
MLJTuning.best(heuristic::SomeHeuristic, history) -> history_entry
where history_entry
is the entry in the history corresponding to the model deemed "best".
Below is a simplified version of code
defining the default heuristic NaiveSelection()
which simply chooses the model with the lowest (or highest) aggregated
performance estimate, based on the first measure specified by the user
in his TunedModel
construction (she may specify more than one).
struct NaiveSelection <: MLJTuning.SelectionHeuristic end
function best(heuristic::NaiveSelection, history)
measurements = [h.measurement[1] for h in history]
measure = first(history).measure[1]
if orientation(measure) == :score
measurements = -measurements
end
best_index = argmin(measurements)
return history[best_index]
end
Because this selection heuristic is generic (applies to all tuning strategies) we additionally define
MLJTuning.supports_heuristic(strategy, heuristic::NaiveSelection) = true
For strategy-specific selection heuristics, see above on how to set this trait.