Machine Learning is the Future!

Saturday, October 3, 2020

Ensemble Learning

Ensemble learning is a very popular method which combine the multiple learners to convert weak learners to strong learner.

Let’s understand it by example:

When we want to purchase iPad, we are not directly go to store or turn online and buy iPad. The common practice we follow is we compare among different models considering features, specifications, prizes and reviews on internet. We also take advice from our friends/ colleague and then finally come to conclusion.

With this example, you can infer that we can make better decision by considering options from different sources. Similar is true as we can consider diverse set of models in comparison to single models. That’s exactly what we achieved in machine learning with the Ensemble Learning technique. This approach allows the production of better predictive performance compared to a single model. That is why ensemble methods placed first in many prestigious machine learning competitions, such as the Netflix Competition, KDD 2009, and Kaggle.

Bias-Variance trade-off:

In machine learning the choice of model is extremely important to get good results. The choice of the model depend on various parameters like problem scope, data distribution, outliers, data quantity, feature dimensionality etc.

A low bias and a low variance are the most often important features of model. However, bias-variance trade-off is most common. Very often they move in opposite direction such as high bias low variance or low bias high variance. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Model with high variance pays a lot of attention to training data and does not generalize on the test data. In any modeling, there will always be a tradeoff between bias and variance and when we build models, we try to achieve the best balance.

Fig: Bias-Variance trade-off

In Ensemble learning we combine several base models a.k.a. weak learners to resolve the underlying complexity of data. Most of the time these basic models used in isolation can’t perform so well due to high bias or high variance. The beauty of Ensemble learning as they can reduce bias-variance tradeoff in order to create strong learner that achieves better performance.

Simple Ensemble Techniques:

1. Voting Classifier :

This technique is used in classification problems where the target outcome is discrete value. Set of base learners such as knn, random forest, svm, decision tree are fitted on training set. Aggregate the prediction by each learner and majority is chosen as final prediction.

Fig: Voting

2. Averaging

Averaging is used for regression problems such as house prize prediction, loan amount prediction, where the target outcome is continuous value. This makes the final prediction by averaging the outcome of different algorithms.

3. Weighted Averaging

This is same as averaging with different weights are assigned to models as per importance and get the final prediction.

Advanced Ensemble techniques

1. Bagging

Bagging is homogenous ensemble technique where same base learners are trained in parallel on different random subsets of the training set and helps to get better predictions. Bootstrapping is used to create random subsets of train data with or without replacement. If we consider with replacement, samples may repeat in subset. Without replacement ensures about unique samples in each subset. This bootstrapping offers diversity/less correlation in base learners and can achieve generalization.

Once the training is one, the ensemble can make prediction for test pattern by aggregating the predicted values of all trained base learners. This aggregation helps to reduce bias and variance compare to each individual base learner having high bias.

e.g Random forest

Fig: Bagging

2. Boosting

Boosting is a homogenous weak learner, learns sequentially in an adaptive way. It’s a sequential process where each subsequent model attempts to fix the errors of its predecessor. Boosting decreases the bias error and produces strong predictive model. Boosting can be viewed as model averaging method. It can be used for classification as well as regression.

e.g Adaboost, Gradient boosing machine, XGBoost, Light GBM.

Fig: Boosting

3. Stacking

Stacking, also known as Stacked Generalization is an ensemble technique that combines multiple classifications or regression models via a meta-classifier or a meta-regressor. The base-level models are trained on a complete training set, then the meta-model is trained on the features that are outputs of the base-level model. The base-level often consists of different learning algorithms and therefore stacking ensembles are often heterogeneous.

The predictions made by base models on out-of-fold data is used to train meta-model. We can understand stacking process with the following steps:

Stacking:

1. Split training data into folds (say 4).

2. Base models are trained on each training fold and predict on out of fold (OOF).

3. The OOF predictions are given as input to meta-learner.

4. Meta-learner is trained on these OOF predictions, and can run meta-learner on the test set for final predictions.

Fig: Stacking

4. Blending

Blending is very similar to Stacking. It holds out part of the training data (say a 80/20 split – 80(Train)/20(Validation)). Train base models on the 80 part, predict on the 20 part as well as the test set. Train your meta-learner with the 20 set predictions as features, then run your meta-learner on the test set for your final submission predictions.

Fig: Blending

Takeaway

1. In the pattern Recognition field, there is no guarantee that specific classifier can achieve the best performance in every situation. However, better predictive performance can be achieved through ensemble learning, which is the kernel idea of ensemble learning and has been widely applied in machine learning and pattern recognition field.

2. Ensemble methods work best with less correlated base learners.

3. Excellent generalization performance of ensemble models depend upon diversity and accuracy. Diversity can be obtained using bootstrapping, using different algorithms etc.

4. There are mainly two challenges in ensemble learning:

i. How to generate new classifier ensemble?

ii. How to search for the optimal fusion of the base classifiers?

5. There is no killer classifier for anything. Ensemble learning scheme depends on several factors such as problem complexity, data imbalance, amount of data, noise in the data and quality of data. Sometimes for simple problem, small dataset single base learner is enough.

6. In reality it may be impractical to use ensemble learning such as stacking on large datasets, since its very time consuming. Even if we get better stacked model deploying such model into production may be infeasible.

Friday, January 25, 2019

Standard Deviation

The concept of standard deviation was first introduced by Karl Pearson in 1893.

Let’s first understand the significance of standard deviation:

When we look at the data for a population often the first thing we do is look at the mean. But even if we know that the distribution is perfectly normal, the mean isn't enough to tell us what we know to understand what the mean is telling us about the population. We also need to know something about how the data is spread out around the mean - that is, how wide the bell curve is around the mean. Yes, there is the basic measure comes i.e standard deviation.

Standard deviation is a widely used measure of variability or measure of dispersion. It shows how much variation or "dispersion" exists from the mean or expected value.

A low standard deviation means that most of the numbers are very close to mean. A high standard deviation means that the numbers are spread out.

One can also say a smaller standard deviation means the variation is small in the data and a large standard deviation means the variation is large in the data.

Fig. Standard Deviation

A step-by-step method for calculating the standard deviation:

(1) Find out the mean of the data set.

(2) Subtract this mean from each data point to find out deviation from the mean. It could be either positive or negative.

(3) Square up these deviations to find out squared deviation. Naturally, squared deviations will be all positive.

(4) Find out the mean of squared deviations. This is called variance.

(5) Find out the square root of the variance. That's the standard deviation.

Different examples to understand the standard deviation in an easy manner:

Some examples in which standard deviation might help to understand the value of the data:

1. A class of students took a math test. Their teacher found that the mean score on the test was 85%. The teacher then calculated the standard deviation of the other test scores and found a very small standard deviation which suggested that most students scored very close to 85%.

2. A market researcher is analyzing the results of a recent customer survey. He wants to have some measure of the reliability of the answers received in the survey in order to predict how a larger group of people might answer the same questions. A low standard deviation shows that the answers are very projectable to a larger group of people.

3. An employer wants to determine if the salaries in one department seem fair for all employees, or if there is a great disparity. He finds the average of the salaries in that department and then calculates the variance, and then the standard deviation. The employer finds that the standard deviation is slightly higher than he expected, so he examines the data further and finds that while most employees fall within a similar pay bracket, three loyal employees who have been in the department for 20 years or more, far longer than the others, are making far more due to their longevity with the company. Doing the analysis helped the employer to understand the range of salaries of the people in the department.

Let’s understand how to Standardize data?

The data is standardized by subtracting the mean and then dividing by the standard deviation which ensures that all of your variables have mean zero and variance/standard deviation of 1. Standardization of the data is very much important as we can compare them on a similar scale.

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of standard deviation. The tutorial also highlights how to standardize data. Good Luck!

Thursday, September 20, 2018

Linear Discriminant Analysis (LDA)

LDA is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible.

How does it work?

Basically, LDA helps you to find the 'boundaries' around clusters of classes. It projects your data points on a line so that your clusters 'are as separated as possible', with each cluster having a relative (close) distance to a centroid.

What its actually doing?

1. Calculating mean vectors of the data in all dimensions.

2. Calculates scatter from the whole group (to determine separability)

3. Calculates scatter from representatives of the same class, using the whole group scatter as a normalizer.

4. Magical grouping around K centroids.

Let us say you have data that is represented by 100-dimensional feature vectors and you have 100000 data points. You know that these data points belong to three different classes but you are not sure which combination of features are mostly affecting their separation. The data you have is too large to perform any reasonable computation in the reasonable time. So you want to reduce these 100-dimensional feature vector to say 50-dimensional feature vector to allow you to learn the data more efficiently.

Performing Principal Component Analysis (PCA) to reduce the number of features(dimensions) would have given you which all features affected your data by computing their leading eigenvalues. But you are not satisfied, though you have obtained the 50 new features, they do not correctly distinguish the 3 classes as they were in the original data.

You want to preserve as the difference between the classes as well while reducing the dimensions. You look for a better alternative, and it leads you to Linear Discriminant Analysis which reduces the number of features by also considering the inter-class separation between the classes.

LDA just reduces the number of dimensions of the input feature vector by preserving the inter-class separation as present in the original feature vector.

Figure 1: Three Class Feature Data

In Figure 1, a 3-dimensional input feature vector is reduced to 1-dimensional feature vector in the meantime preserving the differences among the classes.

Let's talk about linear regression first. You may know that linear regression analysis tries to fit a line through the data points in an n-dimensional plane,

such that the distances between the points and the line are minimized.

Discriminant Analysis is the opposite of linear regression. Here, the task is to maximize the distance between the discrimination boundary or the discriminating line to the data points on either side of the line and minimize the distances between the points themselves.

we know that the hypothesis equation is h(x) = w(t).x + c

The discriminant analysis tries to find the optimum w and c, such that the above-explained theory holds true.

The linear discriminant analysis is the statistical method of classifying an observation having p component in one of the two groups... It is developed by Fisher... Actually, it gives two regions separated by a line so that helps in classifying the given data... The region & separate line is the defined by linear discriminant function... For more details may go through any reference book on Multivariate Analysis...

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems.

If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.

Limitations of Logistic Regression

Logistic regression is a simple and powerful linear classification algorithm. It also has limitations that suggest at the need for alternate linear classification algorithms.

·Two-Class Problems: Logistic regression is intended for two-class or binary classification problems. It can be extended for multi-class classification but is rarely used for this purpose.

·Unstable With Well Separated Classes: Logistic regression can become unstable when the classes are well separated.

·Unstable With Few Examples: Logistic regression can become unstable when there are few examples from which to estimate the parameters.

Figure 2: 2D mapping of Features

Linear Discriminant Analysis does address each of these points and is the go-to linear method for multi-class classification problems. Even with binary-classification problems, it is a good idea to try both logistic regression and linear discriminant analysis.

It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, this is the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix.

These statistical properties are estimated from your data and plug into the LDA equation to make predictions. These are the model values that you would save to file for your model.

LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made.

How to Prepare Data for LDA

This section lists some suggestions you may consider when preparing your data for use with LDA.

·Classification Problems: This might go without saying, but LDA is intended for classification problems where the output variable is categorical. LDA supports both binary and multi-class classification.

·Gaussian Distribution: The standard implementation of the model assumes a Gaussian distribution of the input variables.

·Remove Outliers: Consider removing outliers from your data. These can skew the basic statistics used to separate classes in LDA such the mean and the standard deviation.

·Same Variance: LDA assumes that each input variable has the same variance. It is almost always a good idea to standardize your data before using LDA so that it has a mean of 0 and a standard deviation of 1.

Remember one thing if your dataset is linearly separable, then only apply LDA as a classifier, you will get great results.

Difference between LDA and PCA

LDA is a method of dimensionality reduction. Another well-known one is Principal Component Analysis (PCA).

If you want to know more about PCA please click here

The difference is that PCA does not take into account the class information.

For the two clusters in the above Figure 2, PCA will try to find the direction that maximizes the variance and projects the data onto that direction, which is along the y-axis in this case.This is clearly not ideal. We actually lost information because the projections of the two clusters are no longer separable.

Without any math, LDA needs to accomplish two things: maximize the variance between the two clusters, and minimize the variance of the points within each cluster, after the projection. This results in two projected clusters that are clearly separated. Note that in this case, we’re actually using the fact that there are two clusters, i.e, the class information.

Mathematically, the two goals can be formulated into two covariance matrices. You can read in more detail...

It is based on your dataset. If the dataset has high variance, you need to reduce the number of features and add more dataset. After that use non-linear method for classification.

If the dataset with low variance, use a linear model.

If the dataset is small... having less variance .. use linear model.. otherwise use nonlinear model...

Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Linear Discriminant Analysis. The difference between LDA and PCA is also highlighted at the end of the tutorial. Good Luck!

Tuesday, July 31, 2018

Principal Component Analysis

Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique. PCA transforms the data from one feature space to other feature space of low dimension. The transformed feature space should be able to explain most of the variance of the original data set by making a variable reduction. The final variables will be named as principal component.

Let’s understand it in simplistic way.

Consider the Pattern Classification problem.

The pattern classification problem is divided into two phases, namely, training and testing.

In the training phase, first the input data is preprocessed. Then the features are extracted from the processed data. These features are fed to the classifier. The classifier is learned through the features. Once the model is trained it stores the knowledge in the terms of weights. In the testing phase, the trained model is used to predict the class of test input.

Fig.1 Pattern Classification system (Source: Wikipedia)

The feature extraction aims at representing the signals by an ideally small number of relevant values, which describe the task-relevant information contained in the signals. However, classifiers are able to learn from data which class corresponds to which input features. So the feature extraction technique plays very critical role.

All the extracted features are not useful for classification purpose. After feature extraction feature dimension reduction takes place. We are interested in the discriminate features. e.g there are two classes truck and bike. If we consider the feature as color or number of wheels we are not able to predict the class (bike /truck) because color may be same and number of wheels are four. Hence we are interested in discriminate features e.g height. Yes.. That’s what we want . We need to transform original feature space into new feature space having maximum variance, hence the dimension reduction takes place in the transformed new feature space.

There are two techniques for dimension reduction

1. Linear Discriminant Analysis

2. Principal Component Analysis

In this tutorial you will learn about PCA!

The basic difference between these two is that LDA uses information of classes to find new features in order to maximize its separability while PCA uses the variance of each feature to do the same. In this context, LDA can be consider a supervised algorithm and PCA an unsupervised algorithm.

In this tutorial you will learn about

1. PCA Working

2. Linear Transformation

3. Key points of PCA

Principal Component Analysis (PCA):

PCA projects the entire feature space into a different feature space with reduction in dimensionality.

But remember PCA does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

PCA Working :

Let’s start with the working of PCA.

Fig. 2 depicts the flow of data

Fig.2 PCA Working (Source: Wikipedia)

(Hey! I am not going in mathematics...Here you will find out detail theoretical explanation with significance of PCA)

Let's start..

Consider the 2D Plot of the data (as shown in Fig.3)

Fig.3 2-D data plot (Source: Wikipedia)

1. Subtraction of the mean from the data:

As we can see, the subtraction of the mean results in a translation of the data which have now zero mean.

2. Covariance matrix

The covariance of two random variables measures the degree of variation from their respective means with respect to each other. The sign of the covariance provides us with information about the relation between them:

· If the covariance is positive, then the two variables increase and decrease together,

· If the covariance is negative, then when one variable increases the other decreases and vice versa.

3. Eigenvectors and Eigenvalues

Eigenvectors are defined as those vectors whose directions remain unchanged after any linear transformation has been applied to them. However, their length could not remain the same after the transformation, i.e., the result of this transformation is the vector multiplied by a scalar. This scalar is called eigenvalue and each eigenvector has one associated to it.

The number of eigenvectors or components that we can calculate for each data set is equal to the dimension of the data set. In this case, we have a 2-dimensionalal data set so the number of eigenvectors will be 2. Fig. 4 depicts the eigenvectors.

Fig.4 Eigenvectors (Source: Wikipedia)

Since they are calculated from the covariance matrix described before, eigenvectors represent the directions in which the data have more variance. On the other hand, their respective eigenvalues determine the amount of variance that the data set has in that direction.

4. Principal components

Among all the available eigenvectors that have been calculated in the previous step, we must select those ones onto which we are going to project the data. The selected eigenvectors will be called principal components.

Now question is to which eigen vector to choose?

In order to establish a criterion to select the eigenvectors, we must first define the relative variance of each eigenvector and the total variance of a data set. The relative variance of an eigenvector measures how much information can be attributed to it. The total variance of a data set is the sum of the variance of all the variables.

Here we will find out eigenvector-1 and eigenvector-2 is having around 85% and 15% relative variance respectively.

A common way to select the variables is establish the amount of information that we want the final data set to explain. If this amount of information decreases, the number of principal components that we will select will decrease as well. In this case, as we want to reduce the 2-dimensional data set into a 1-dimensional data set, we will select just the first eigenvector as principal component. As a consequence, the final reduced data set will explain around 85% of the variance of the original one.

5. Reduction of data dimension

Once we have selected the principal components, the data must be projected onto them. The next image shows the result of this projection for our example.

Fig.5 Principal Component (Source: Wikipedia)

Although this projection can explain most of the variance of the original data, we have lost the information about the variance along the second component. In general, this process is irreversible, which means that we cannot recover the original data from the projection.

Linear Transformation:

PCA does the linear transformation from one feature space to new feature space.

Let’s understand the linear transformation with the help of matrix example.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.

Av = b

So A turned v into b.

In the Fig. 6 we see how the matrix mapped the short, low line v, to the long, high one, b.

Fig.6 Data mapping to Vectors (Source: Wikipedia)

Imagine that all the input vectors v live in a normal grid, like in Fig. 7:

Fig.7 Data mapping to Grid (Source: Wikipedia)

And the matrix projects them all into a new space like the one below, which holds the output vectors b:

Fig.8 Linear Transformation (Source: Wikipedia)

The eigenvector tells you the direction the matrix is blowing in.

Fig.9 Linear Transformation applied on Image (Source: Wikipedia)

So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that that changes length but not direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. The blue line is eigenvector.

Key Points to remember:

1. It is simply that the assumptions underlying PCA are linear - and the interpretation is only valid if those assumptions are true. OF course, you can still do a PCA computation on nonlinear data - but the results will be meaningless.

2. Does PCA always lose information? No.

Does it sometimes lose information? Yes.

You can reconstruct the original data from components. If it always lost information then this would not be possible.It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important.

3. PCA is not a classification method. Remember never use PCA to do classification, but you can use it to imrove performance of classifier.

4. When you apply PCA to your data you are guaranteeing there will be no correlation between the resulting features. Many classification algorithms benefit from it

5. Last but not least.. Always remember...PCA is a feature engineering method. It does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Principal Component Analysis. Few key points are heighlighted at the end of the tutorial. Good Luck!

Further reading!

Are you interested in Deep Learning- Convolutional Neural Network!

1. Document Classification using Deep Learning- Click here

2. Improving Performance of Convolutional Neural Network! Click here

Are you interested in Correlation -Statistical Analysis! Click here