Machine Learning is the Future!: 2018

Thursday, September 20, 2018

Linear Discriminant Analysis (LDA)

LDA is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible.

How does it work?

Basically, LDA helps you to find the 'boundaries' around clusters of classes. It projects your data points on a line so that your clusters 'are as separated as possible', with each cluster having a relative (close) distance to a centroid.

What its actually doing?

1. Calculating mean vectors of the data in all dimensions.

2. Calculates scatter from the whole group (to determine separability)

3. Calculates scatter from representatives of the same class, using the whole group scatter as a normalizer.

4. Magical grouping around K centroids.

Let us say you have data that is represented by 100-dimensional feature vectors and you have 100000 data points. You know that these data points belong to three different classes but you are not sure which combination of features are mostly affecting their separation. The data you have is too large to perform any reasonable computation in the reasonable time. So you want to reduce these 100-dimensional feature vector to say 50-dimensional feature vector to allow you to learn the data more efficiently.

Performing Principal Component Analysis (PCA) to reduce the number of features(dimensions) would have given you which all features affected your data by computing their leading eigenvalues. But you are not satisfied, though you have obtained the 50 new features, they do not correctly distinguish the 3 classes as they were in the original data.

You want to preserve as the difference between the classes as well while reducing the dimensions. You look for a better alternative, and it leads you to Linear Discriminant Analysis which reduces the number of features by also considering the inter-class separation between the classes.

LDA just reduces the number of dimensions of the input feature vector by preserving the inter-class separation as present in the original feature vector.

Figure 1: Three Class Feature Data

In Figure 1, a 3-dimensional input feature vector is reduced to 1-dimensional feature vector in the meantime preserving the differences among the classes.

Let's talk about linear regression first. You may know that linear regression analysis tries to fit a line through the data points in an n-dimensional plane,

such that the distances between the points and the line are minimized.

Discriminant Analysis is the opposite of linear regression. Here, the task is to maximize the distance between the discrimination boundary or the discriminating line to the data points on either side of the line and minimize the distances between the points themselves.

we know that the hypothesis equation is h(x) = w(t).x + c

The discriminant analysis tries to find the optimum w and c, such that the above-explained theory holds true.

The linear discriminant analysis is the statistical method of classifying an observation having p component in one of the two groups... It is developed by Fisher... Actually, it gives two regions separated by a line so that helps in classifying the given data... The region & separate line is the defined by linear discriminant function... For more details may go through any reference book on Multivariate Analysis...

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems.

If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.

Limitations of Logistic Regression

Logistic regression is a simple and powerful linear classification algorithm. It also has limitations that suggest at the need for alternate linear classification algorithms.

·Two-Class Problems: Logistic regression is intended for two-class or binary classification problems. It can be extended for multi-class classification but is rarely used for this purpose.

·Unstable With Well Separated Classes: Logistic regression can become unstable when the classes are well separated.

·Unstable With Few Examples: Logistic regression can become unstable when there are few examples from which to estimate the parameters.

Figure 2: 2D mapping of Features

Linear Discriminant Analysis does address each of these points and is the go-to linear method for multi-class classification problems. Even with binary-classification problems, it is a good idea to try both logistic regression and linear discriminant analysis.

It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, this is the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix.

These statistical properties are estimated from your data and plug into the LDA equation to make predictions. These are the model values that you would save to file for your model.

LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made.

How to Prepare Data for LDA

This section lists some suggestions you may consider when preparing your data for use with LDA.

·Classification Problems: This might go without saying, but LDA is intended for classification problems where the output variable is categorical. LDA supports both binary and multi-class classification.

·Gaussian Distribution: The standard implementation of the model assumes a Gaussian distribution of the input variables.

·Remove Outliers: Consider removing outliers from your data. These can skew the basic statistics used to separate classes in LDA such the mean and the standard deviation.

·Same Variance: LDA assumes that each input variable has the same variance. It is almost always a good idea to standardize your data before using LDA so that it has a mean of 0 and a standard deviation of 1.

Remember one thing if your dataset is linearly separable, then only apply LDA as a classifier, you will get great results.

Difference between LDA and PCA

LDA is a method of dimensionality reduction. Another well-known one is Principal Component Analysis (PCA).

If you want to know more about PCA please click here

The difference is that PCA does not take into account the class information.

For the two clusters in the above Figure 2, PCA will try to find the direction that maximizes the variance and projects the data onto that direction, which is along the y-axis in this case.This is clearly not ideal. We actually lost information because the projections of the two clusters are no longer separable.

Without any math, LDA needs to accomplish two things: maximize the variance between the two clusters, and minimize the variance of the points within each cluster, after the projection. This results in two projected clusters that are clearly separated. Note that in this case, we’re actually using the fact that there are two clusters, i.e, the class information.

Mathematically, the two goals can be formulated into two covariance matrices. You can read in more detail...

It is based on your dataset. If the dataset has high variance, you need to reduce the number of features and add more dataset. After that use non-linear method for classification.

If the dataset with low variance, use a linear model.

If the dataset is small... having less variance .. use linear model.. otherwise use nonlinear model...

Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Linear Discriminant Analysis. The difference between LDA and PCA is also highlighted at the end of the tutorial. Good Luck!

Tuesday, July 31, 2018

Principal Component Analysis

Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique. PCA transforms the data from one feature space to other feature space of low dimension. The transformed feature space should be able to explain most of the variance of the original data set by making a variable reduction. The final variables will be named as principal component.

Let’s understand it in simplistic way.

Consider the Pattern Classification problem.

The pattern classification problem is divided into two phases, namely, training and testing.

In the training phase, first the input data is preprocessed. Then the features are extracted from the processed data. These features are fed to the classifier. The classifier is learned through the features. Once the model is trained it stores the knowledge in the terms of weights. In the testing phase, the trained model is used to predict the class of test input.

Fig.1 Pattern Classification system (Source: Wikipedia)

The feature extraction aims at representing the signals by an ideally small number of relevant values, which describe the task-relevant information contained in the signals. However, classifiers are able to learn from data which class corresponds to which input features. So the feature extraction technique plays very critical role.

All the extracted features are not useful for classification purpose. After feature extraction feature dimension reduction takes place. We are interested in the discriminate features. e.g there are two classes truck and bike. If we consider the feature as color or number of wheels we are not able to predict the class (bike /truck) because color may be same and number of wheels are four. Hence we are interested in discriminate features e.g height. Yes.. That’s what we want . We need to transform original feature space into new feature space having maximum variance, hence the dimension reduction takes place in the transformed new feature space.

There are two techniques for dimension reduction

1. Linear Discriminant Analysis

2. Principal Component Analysis

In this tutorial you will learn about PCA!

The basic difference between these two is that LDA uses information of classes to find new features in order to maximize its separability while PCA uses the variance of each feature to do the same. In this context, LDA can be consider a supervised algorithm and PCA an unsupervised algorithm.

In this tutorial you will learn about

1. PCA Working

2. Linear Transformation

3. Key points of PCA

Principal Component Analysis (PCA):

PCA projects the entire feature space into a different feature space with reduction in dimensionality.

But remember PCA does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

PCA Working :

Let’s start with the working of PCA.

Fig. 2 depicts the flow of data

Fig.2 PCA Working (Source: Wikipedia)

(Hey! I am not going in mathematics...Here you will find out detail theoretical explanation with significance of PCA)

Let's start..

Consider the 2D Plot of the data (as shown in Fig.3)

Fig.3 2-D data plot (Source: Wikipedia)

1. Subtraction of the mean from the data:

As we can see, the subtraction of the mean results in a translation of the data which have now zero mean.

2. Covariance matrix

The covariance of two random variables measures the degree of variation from their respective means with respect to each other. The sign of the covariance provides us with information about the relation between them:

· If the covariance is positive, then the two variables increase and decrease together,

· If the covariance is negative, then when one variable increases the other decreases and vice versa.

3. Eigenvectors and Eigenvalues

Eigenvectors are defined as those vectors whose directions remain unchanged after any linear transformation has been applied to them. However, their length could not remain the same after the transformation, i.e., the result of this transformation is the vector multiplied by a scalar. This scalar is called eigenvalue and each eigenvector has one associated to it.

The number of eigenvectors or components that we can calculate for each data set is equal to the dimension of the data set. In this case, we have a 2-dimensionalal data set so the number of eigenvectors will be 2. Fig. 4 depicts the eigenvectors.

Fig.4 Eigenvectors (Source: Wikipedia)

Since they are calculated from the covariance matrix described before, eigenvectors represent the directions in which the data have more variance. On the other hand, their respective eigenvalues determine the amount of variance that the data set has in that direction.

4. Principal components

Among all the available eigenvectors that have been calculated in the previous step, we must select those ones onto which we are going to project the data. The selected eigenvectors will be called principal components.

Now question is to which eigen vector to choose?

In order to establish a criterion to select the eigenvectors, we must first define the relative variance of each eigenvector and the total variance of a data set. The relative variance of an eigenvector measures how much information can be attributed to it. The total variance of a data set is the sum of the variance of all the variables.

Here we will find out eigenvector-1 and eigenvector-2 is having around 85% and 15% relative variance respectively.

A common way to select the variables is establish the amount of information that we want the final data set to explain. If this amount of information decreases, the number of principal components that we will select will decrease as well. In this case, as we want to reduce the 2-dimensional data set into a 1-dimensional data set, we will select just the first eigenvector as principal component. As a consequence, the final reduced data set will explain around 85% of the variance of the original one.

5. Reduction of data dimension

Once we have selected the principal components, the data must be projected onto them. The next image shows the result of this projection for our example.

Fig.5 Principal Component (Source: Wikipedia)

Although this projection can explain most of the variance of the original data, we have lost the information about the variance along the second component. In general, this process is irreversible, which means that we cannot recover the original data from the projection.

Linear Transformation:

PCA does the linear transformation from one feature space to new feature space.

Let’s understand the linear transformation with the help of matrix example.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.

Av = b

So A turned v into b.

In the Fig. 6 we see how the matrix mapped the short, low line v, to the long, high one, b.

Fig.6 Data mapping to Vectors (Source: Wikipedia)

Imagine that all the input vectors v live in a normal grid, like in Fig. 7:

Fig.7 Data mapping to Grid (Source: Wikipedia)

And the matrix projects them all into a new space like the one below, which holds the output vectors b:

Fig.8 Linear Transformation (Source: Wikipedia)

The eigenvector tells you the direction the matrix is blowing in.

Fig.9 Linear Transformation applied on Image (Source: Wikipedia)

So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that that changes length but not direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. The blue line is eigenvector.

Key Points to remember:

1. It is simply that the assumptions underlying PCA are linear - and the interpretation is only valid if those assumptions are true. OF course, you can still do a PCA computation on nonlinear data - but the results will be meaningless.

2. Does PCA always lose information? No.

Does it sometimes lose information? Yes.

You can reconstruct the original data from components. If it always lost information then this would not be possible.It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important.

3. PCA is not a classification method. Remember never use PCA to do classification, but you can use it to imrove performance of classifier.

4. When you apply PCA to your data you are guaranteeing there will be no correlation between the resulting features. Many classification algorithms benefit from it

5. Last but not least.. Always remember...PCA is a feature engineering method. It does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Principal Component Analysis. Few key points are heighlighted at the end of the tutorial. Good Luck!

Further reading!

Are you interested in Deep Learning- Convolutional Neural Network!

1. Document Classification using Deep Learning- Click here

2. Improving Performance of Convolutional Neural Network! Click here

Are you interested in Correlation -Statistical Analysis! Click here

Wednesday, July 18, 2018

Correlation - Statistical Analysis!

The most important step in computer vision or machine learning is to understand data well and use that knowledge to make the best design choice.

The open question is

How to understand data well?

The answer is by applying statistical techniques...

Hence the red theme of this tutorial is to understand most important statistical technique i.e Correlation.

The word correlation is used in everyday life to denote some form of association. It is a statistical technique that can show whether and how strongly pairs of variables are related. We might say that we have noticed correlation between student attendance and marks obtained. However, in statistical terms we use correlation to denote association between two quantitative variables. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Fig. 1 depicts the positive, negative and no correlation.

Fig.1 Types of Correlation (Source:Wikipedia)

What is the significance of correlation?

In machine learning before applying any classifier we must find out correlation of intra-intent and inter-intent patterns. The correlation of intra-intent patterns is obviously higher than inter-intent patterns. For example patterns of the same class have more association binding than patterns of different classes. It is good practice to decide hypothesis of your problem statement w.r.t. correlation and check whether really the problem is of pattern classification.

In this tutorial you will learn about

1. Correlation Coefficient

2. Pearson vs. Spearman correlation technique

3. Views from Applied Perspective

1. Correlation coefficient

The degree of association is measured by correlation coefficient. A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation. The Table 1 describes the strength of relationship.

Table 1. Strength of Relationship

How to get this r value..

There are different correlation techniques to get this r

Let’s start with interesting stuff...

2. Pearson vs. Spearman correlation technique

Pearson correlation is parametric whilst Spearman correlation is nonparametric test.

First understand the difference in parametric Vs. Nonparametric test.

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The following formula is used to calculate the Pearson r correlation:

r = Pearson r correlation coefficient
N = number of observations
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores

Key Points :

1. In general, when the data is normally distributed we are using Pearson correlation. The normal distribution is always symmetrical about mean which looks like bell curve.

2. Linearity is not the assumption of Pearson correlation. Pearson correlation determines the degree to which relation is linear. The relation is linear if variables increase or decrease at constant speed.

Python script to compute Pearson correlation coefficient

>>> import matplotlib.pyplot as plt

>>> from scipy import stats

>>> np.random.seed(12345678)

>>> x = np.random.random(10)

>>> y = np.random.random(10)

>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

where,

slope- It is the slope of the regression line.

intercept- It is the intercept of the regression line.

r-value – It is the pearson correlation value. The r value is in between -1 to 1.

P-value- The P-value is a critical value depends on the probability you are allowing for a Type-I error. It is also called as hypothesis test since it tells whether to accept or reject Null hypothesis. (Null hypothesis is a hypothesis that says there is no statistical significance between two variables. It is a hypothesis a resercher will try to disprove)

In general if p<0.05(critical value) reject null hypothesis else accept it.

Std_err- it is the standard error of the estimate.

More explanation about r-value and p-value

r-value tells about the variation within data.

P-value tells about significance of model(i.e model fits the data well)

Let’s understand the Four possibilities:

1. r-value(low) and p-value(low) – Model doesn’t explain much about variation, but is significant. (Better than nothing)

2. r-value(low) and p-value(high) – Model doesn’t explain much about variation and not significant (Worst model)

3. r-value(high) and p-value(low) – Model tells much about variation and significant. (Best model)

4. r-value(high) and p-value(high) – Model explains well about variation but not significant. (Worthless)

Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

The following formula is used to calculate the Spearman rank correlation:

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables
n= number of observations

Key Points :

1. In Spearman Rank correlation information loss is there since it is working on ranks.

2. In general, for monotonic relationship between variables Spearman rank correlation is used. In Monotonic relation the variables tend to move in the same direction but not at constant speed.

3. If the data has outliers i.e few values are far away from others use Spearman rank correlation coefficient.

Python script to compute Spearman Rank correlation coefficient.

>>> from scipy import stats

>>> np.random.seed(12345678)

>>> x = np.random.random(10)

>>> y = np.random.random(10)

>>> scipy.stats.stats.spearmanr(x1,y1)

3. Views from Applied Perspective

Here are few suggestions from applied perspective:

1. Before taking decision whether to apply Pearson or Spearman rank correlation it is good practice to look at the scatterplot

>>> python script to plot scatterplot

>>> import numpy as np

>>> import matplotlib.pyplot as plt

# Fixing random state for reproducibility

>>> np.random.seed(19680801)

>>> N = 50

>>> x = np.random.rand(N)

>>> y = np.random.rand(N)

>>> colors = np.random.rand(N)

>>> area = (30 * np.random.rand(N))**2 # 0 to 15 point radii

>>> plt.scatter(x, y, s=area, c=colors, alpha=0.5)

>>> plt.show()

2. For small sample I will advise to use Spearman rank correlation.

3. For the large sample use Pearson correlation.

Last One

I prefer Pearson correlation coefficient because

1. Pearson correlation is having more statistical power.

2. Pearson correlation enables more direct compatibility of finding across studies, because most of the studies report Pearson correlation.

3, In many cases there is minimal difference between Pearson and Spearman correlation coefficient.

4. Obviously it aligns with my theoretical interests.

Go Further!

I hope you enjoyed this post. The tutorial is good to start statistical analysis using Correlation. The post is very informative not only to get the knowledge of Pearson and Spearman rank correlation but also from the applicability perspective. Good Luck!

Worth reading!

Are you interested in Deep Learning- Convolutional Neural Network!

1. Document Classification using Deep Learning- Click here

2. Improving Performance of Convolutional Neural Network! Click here