A package providing composite models wrapping class imbalance algorithms from Imbalance.jl with classifiers from MLJ.
import Pkg;
Pkg.add("MLJBalancing")
This package allows chaining of resampling methods from Imbalance.jl
with classification models from MLJ
. Simply construct a BalancedModel
object while specifying the model (classifier) and an arbitrary number of resamplers (also called balancers - typically oversamplers and/or undersamplers).
SMOTENC = @load SMOTENC pkg=Imbalance verbosity=0
TomekUndersampler = @load TomekUndersampler pkg=Imbalance verbosity=0
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
oversampler = SMOTENC(k=5, ratios=1.0, rng=42)
undersampler = TomekUndersampler(min_ratios=0.5, rng=42)
logistic_model = LogisticClassifier()
balanced_model = BalancedModel(model=logistic_model, balancer1=oversampler, balancer2=undersampler)
Here training data will be passed to balancer1
then balancer2
, whose output is used to train the classifier model
. When balanced_model
is used for prediction, the resamplers balancer1
and blancer2
are bypassed.
In general, any number of balancers can be passed to the function, and the user can give the balancers arbitrary names while passing them.
You can fit, predict, cross-validate and hyperparamter tune it like any other MLJ model. Here is an example for hyperparameter tuning:
r1 = range(balanced_model, :(balancer1.k), lower=3, upper=10)
r2 = range(balanced_model, :(balancer2.min_ratios), lower=0.1, upper=0.9)
tuned_balanced_model = TunedModel(
model=balanced_model,
tuning=Grid(goal=4),
resampling=CV(nfolds=4),
range=[r1, r2],
measure=cross_entropy
);
mach = machine(tuned_balanced_model, X, y);
fit!(mach, verbosity=0);
fitted_params(mach).best_model
The package also offers an implementation of bagging over probabilistic classifiers where the majority class is repeatedly undersampled T
times down to the size of the minority class. This undersampling scheme was proposed in the EasyEnsemble algorithm found in the paper Exploratory Undersampling for Class-Imbalance Learning. by Xu-Ying Liu, Jianxin Wu, & Zhi-Hua Zhou where an Adaboost model was used and the output scores were averaged.
In this you must specify some probabilistic model, and optionally specify the number of bags T
and the random number generator rng
. If T
is not specified it is set as the ratio between the majority and minority counts. If rng
isn't specified then default_rng()
is used.
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
logistic_model = LogisticClassifier()
bagging_model = BalancedBaggingClassifier(model=logistic_model, T=10, rng=Random.Xoshiro(42))
You can fit, predict, cross-validate and hyperparameter-tune it like any other probabilistic MLJ model where X
must be a table input (e.g., a dataframe).
mach = machine(bagging_model, X, y)
fit!(mach)
pred = predict(mach, X)