Saturday, October 3, 2020

 

Ensemble Learning

Ensemble learning is a very popular method which combine the multiple learners to convert weak learners to strong learner.

Let’s understand it by example:

When we want to purchase iPad, we are not directly go to store or turn online and buy iPad. The common practice we follow is we compare among different models considering features, specifications, prizes and reviews on internet. We also take advice from our friends/ colleague and then finally come to conclusion.

With this example, you can infer that we can make better decision by considering options from different sources. Similar is true as we can consider diverse set of models in comparison to single models. That’s exactly what we achieved in machine learning with the Ensemble Learning technique. This approach allows the production of better predictive performance compared to a single model. That is why ensemble methods placed first in many prestigious machine learning competitions, such as the Netflix Competition, KDD 2009, and Kaggle.

 

Bias-Variance trade-off:

In machine learning the choice of model is extremely important to get good results. The choice of the model depend on various parameters like problem scope, data distribution, outliers, data quantity, feature dimensionality etc.

A low bias and a low variance are the most often important features of model. However, bias-variance trade-off is most common. Very often they move in opposite direction such as high bias low variance or low bias high variance. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Model with high variance pays a lot of attention to training data and does not generalize on the test data. In any modeling, there will always be a tradeoff between bias and variance and when we build models, we try to achieve the best balance.

 

 

Fig: Bias-Variance trade-off

 

 

Fig: Bias-Variance trade-off

 

In Ensemble learning we combine several base models a.k.a. weak learners to resolve the underlying complexity of data. Most of the time these basic models used in isolation can’t perform so well due to high bias or high variance. The beauty of Ensemble learning as they can reduce bias-variance tradeoff in order to create strong learner that achieves better performance.

 

Simple Ensemble Techniques:

1.      Voting Classifier :

This technique is used in classification problems where the target outcome is discrete value. Set of base learners such as knn, random forest, svm, decision tree are fitted on training set. Aggregate the prediction by each learner and majority is chosen as final prediction.

 

Fig: Voting

 

2. Averaging

Averaging is used for regression problems such as house prize prediction, loan amount prediction, where the target outcome is continuous value. This makes the final prediction by averaging the outcome of different algorithms.

 

3. Weighted Averaging

This is same as averaging with different weights are assigned to models as per importance and get the final prediction.

 

Advanced Ensemble techniques

1.     Bagging

Bagging is homogenous ensemble technique where same base learners are trained in parallel on different random subsets of the training set and helps to get better predictions. Bootstrapping is used to create random subsets of train data with or without replacement. If we consider with replacement, samples may repeat in subset. Without replacement ensures about unique samples in each subset. This bootstrapping offers diversity/less correlation in base learners and can achieve generalization.

Once the training is one, the ensemble can make prediction for test pattern by aggregating the predicted values of all trained base learners. This aggregation helps to reduce bias and variance compare to each individual base learner having high bias.

e.g Random forest

Fig: Bagging

 

 

2.     Boosting

Boosting is a homogenous weak learner, learns sequentially in an adaptive way.  It’s a sequential process where each subsequent model attempts to fix the errors of its predecessor. Boosting decreases the bias error and produces strong predictive model. Boosting can be viewed as model averaging method. It can be used for classification as well as regression.

e.g Adaboost, Gradient boosing machine, XGBoost, Light GBM.



Fig: Boosting

 

 

3.     Stacking

Stacking, also known as Stacked Generalization is an ensemble technique that combines multiple classifications or regression models via a meta-classifier or a meta-regressor. The base-level models are trained on a complete training set, then the meta-model is trained on the features that are outputs of the base-level model. The base-level often consists of different learning algorithms and therefore stacking ensembles are often heterogeneous.

The predictions made by base models on out-of-fold data is used to train meta-model. We can understand stacking process with the following steps:

Stacking:

1.      Split training data into folds (say 4).

2.      Base models are trained on each training fold and predict on out of fold (OOF).

3.      The OOF predictions are given as input to meta-learner.

4.      Meta-learner is trained on these OOF predictions, and can run meta-learner on the test set for final predictions.

 

 

Fig: Stacking

 

 

4.     Blending

Blending is very similar to Stacking. It holds out part of the training data (say a 80/20 split – 80(Train)/20(Validation)). Train base models on the 80 part, predict on the 20 part as well as the test set. Train your meta-learner with the 20 set predictions as features, then run your meta-learner on the test set for your final submission predictions.

 

Fig: Blending

 

Takeaway 

1.    In the pattern Recognition field, there is no guarantee that specific classifier can achieve the best performance in every situation. However, better predictive performance can be achieved through ensemble learning, which is the kernel idea of ensemble learning and has been widely applied in machine learning and pattern recognition field. 

2.     Ensemble methods work best with less correlated base learners. 

3.   Excellent generalization performance of ensemble models depend upon diversity and accuracy. Diversity can be obtained using bootstrapping, using different algorithms etc.

       4.     There are mainly two challenges in ensemble learning: 

i.                    How to generate new classifier ensemble?

ii.                  How to search for the optimal fusion of the base classifiers? 

5.    There is no killer classifier for anything. Ensemble learning scheme depends on several factors such as problem complexity, data imbalance, amount of data, noise in the data and quality of data. Sometimes for simple problem, small dataset single base learner is enough. 

6.  In reality it may be impractical to use ensemble learning such as stacking on large datasets, since its very time consuming.  Even if we get better stacked model deploying such model into production may be infeasible.