Ensemble Learning
Ensemble
learning is a very popular method which combine the multiple learners to
convert weak learners to strong learner.
Let’s
understand it by example:
When we want
to purchase iPad, we are not directly go to store or turn online and buy iPad.
The common practice we follow is we compare among different models considering
features, specifications, prizes and reviews on internet. We also take advice
from our friends/ colleague and then finally come to conclusion.
With this
example, you can infer that we can make better decision by considering options
from different sources. Similar is true as we can consider diverse set of
models in comparison to single models. That’s exactly what we achieved in
machine learning with the Ensemble Learning technique. This approach
allows the production of better predictive performance compared to a single
model. That is why ensemble methods placed first in many prestigious machine
learning competitions, such as the Netflix Competition, KDD 2009, and Kaggle.
Bias-Variance
trade-off:
In machine
learning the choice of model is extremely important to get good results. The
choice of the model depend on various parameters like problem scope, data
distribution, outliers, data quantity, feature dimensionality etc.
A low bias
and a low variance are the most often important features of model. However,
bias-variance trade-off is most common. Very often they move in opposite
direction such as high bias low variance or low bias high variance. Model with high bias pays very little
attention to the training data and oversimplifies the model. It always leads to
high error on training and test data. Model with high variance pays a lot of
attention to training data and does not generalize on the test data. In any
modeling, there will always be a tradeoff between bias and variance and when we
build models, we try to achieve the best balance.
Fig: Bias-Variance trade-off
Fig: Bias-Variance trade-off
In Ensemble learning we combine several base models a.k.a. weak learners to resolve the underlying complexity of data. Most of the time these basic models used in isolation can’t perform so well due to high bias or high variance. The beauty of Ensemble learning as they can reduce bias-variance tradeoff in order to create strong learner that achieves better performance.
Simple Ensemble Techniques:
1.
Voting Classifier :
This technique is used in classification
problems where the target outcome is discrete value. Set of base learners such
as knn, random forest, svm, decision tree are fitted on training set. Aggregate
the prediction by each learner and majority is chosen as final prediction.
Fig: Voting
2. Averaging
Averaging is used for regression problems
such as house prize prediction, loan amount prediction, where the target
outcome is continuous value. This makes the final prediction by averaging the
outcome of different algorithms.
3. Weighted Averaging
This is same as averaging with different
weights are assigned to models as per importance and get the final prediction.
Advanced Ensemble techniques
1. Bagging
Bagging is homogenous ensemble technique
where same base learners are trained in parallel on different random subsets of
the training set and helps to get better predictions. Bootstrapping is used to
create random subsets of train data with or without replacement. If we consider
with replacement, samples may repeat in subset. Without replacement ensures
about unique samples in each subset. This bootstrapping offers diversity/less
correlation in base learners and can achieve generalization.
Once the training is one, the ensemble can
make prediction for test pattern by aggregating the predicted values of all
trained base learners. This aggregation helps to reduce bias and variance
compare to each individual base learner having high bias.
e.g Random forest
Fig: Bagging
2.
Boosting
Boosting is a homogenous weak learner, learns sequentially in an
adaptive way. It’s a
sequential process where each subsequent model attempts to fix the errors of
its predecessor. Boosting decreases the bias error and produces strong
predictive model. Boosting can be viewed as model averaging method. It can be
used for classification as well as regression.
e.g Adaboost, Gradient boosing machine, XGBoost, Light GBM.
Fig: Boosting
3. Stacking
Stacking, also known as Stacked Generalization is an ensemble technique that combines multiple classifications or regression models via a meta-classifier or a meta-regressor. The base-level models are trained on a complete training set, then the meta-model is trained on the features that are outputs of the base-level model. The base-level often consists of different learning algorithms and therefore stacking ensembles are often heterogeneous.
The
predictions made by base models on out-of-fold data is used to train
meta-model. We can understand stacking process with the following steps:
Stacking:
1. Split training data into folds (say 4).
2. Base models are trained on each training fold
and predict on out of fold (OOF).
3. The OOF predictions are given as input to meta-learner.
4. Meta-learner is trained on these OOF
predictions, and can run meta-learner on the test set for final predictions.
Fig: Stacking
4. Blending
Blending is very similar to Stacking. It holds
out part of the training data (say a 80/20 split – 80(Train)/20(Validation)).
Train base models on the 80 part, predict on the 20 part as well as the test
set. Train your meta-learner with the 20 set predictions as features, then run
your meta-learner on the test set for your final submission predictions.
Fig: Blending
Takeaway
1. In the pattern Recognition field, there is no guarantee that specific classifier can achieve the best performance in every situation. However, better predictive performance can be achieved through ensemble learning, which is the kernel idea of ensemble learning and has been widely applied in machine learning and pattern recognition field.
2. Ensemble methods work best with less correlated base learners.
3. Excellent
generalization performance of ensemble models depend upon diversity and
accuracy. Diversity can be obtained using bootstrapping, using different
algorithms etc.
i.
How to generate new classifier ensemble?
ii. How to search for the optimal fusion of the base classifiers?
5. There is no killer classifier for anything. Ensemble learning scheme depends on several factors such as problem complexity, data imbalance, amount of data, noise in the data and quality of data. Sometimes for simple problem, small dataset single base learner is enough.
6. In
reality it may be impractical to use ensemble learning such as stacking on
large datasets, since its very time consuming.
Even if we get better stacked model deploying such model into production
may be infeasible.