Tuesday, July 31, 2018


Principal Component Analysis

Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique. PCA transforms the data from one feature space to other feature space of low dimension.  The transformed feature space should be able to explain most of the variance of the original data set by making a variable reduction. The final variables will be named as principal component.

Let’s understand it in simplistic way.

Consider the Pattern Classification problem.

The pattern classification problem is divided into two phases, namely, training and testing.

In the training phase, first the input data is preprocessed. Then the features are extracted from the processed data. These features are fed to the classifier. The classifier is learned through the features. Once the model is trained it stores the knowledge in the terms of weights. In the testing phase, the trained model is used to predict the class of test input.

Fig.1 Pattern Classification system (Source: Wikipedia)


The feature extraction aims at representing the signals by an ideally small number of relevant values, which describe the task-relevant information contained in the signals. However, classifiers are able to learn from data which class corresponds to which input features. So the feature extraction technique plays very critical role.

All the extracted features are not useful for classification purpose. After feature extraction feature dimension reduction takes place. We are interested in the discriminate features. e.g there are two classes truck and bike. If we consider the feature as color or number of wheels we are not able to predict the class (bike /truck) because color may be same and number of wheels are four. Hence we are interested in discriminate features e.g height. Yes.. That’s what  we want .  We need to transform original feature space into new feature space having maximum variance, hence the dimension reduction takes place in the transformed new feature space.

There are two techniques for dimension reduction

1. Linear Discriminant Analysis 

2. Principal Component Analysis



In this tutorial you will learn about PCA!

The basic difference between these two is that LDA uses information of classes to find new features in order to maximize its separability while PCA uses the variance of each feature to do the same. In this context, LDA can be consider a supervised algorithm and PCA an unsupervised algorithm.

In this tutorial you will learn about
1. PCA Working 

2. Linear Transformation

3. Key points of PCA

Principal Component Analysis (PCA):
PCA projects the entire feature space into a different feature space with reduction in dimensionality.

But remember PCA does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.


PCA Working :
Let’s start with the working of PCA.
Fig. 2 depicts the flow of data

Fig.2 PCA Working (Source: Wikipedia)

(Hey! I am not going in mathematics...Here you will find out detail theoretical explanation with significance of PCA)


Let's start..

Consider the 2D Plot of the data (as shown in Fig.3)

Fig.3 2-D data plot (Source: Wikipedia)

1. Subtraction of the mean from the data:
As we can see, the subtraction of the mean results in a translation of the data which have now zero mean.
2. Covariance matrix
The covariance of two random variables measures the degree of variation from their respective means with respect to each other. The sign of the covariance provides us with information about the relation between them:
·     If the covariance is positive, then the two variables increase and decrease together,
·     If the covariance is negative, then when one variable increases the other decreases and vice versa.

3. Eigenvectors and Eigenvalues
Eigenvectors are defined as those vectors whose directions remain unchanged after any linear transformation has been applied to them. However, their length could not remain the same after the transformation, i.e., the result of this transformation is the vector multiplied by a scalar. This scalar is called eigenvalue and each eigenvector has one associated to it.

The number of eigenvectors or components that we can calculate for each data set is equal to the dimension of the data set. In this case, we have a 2-dimensionalal data set so the number of eigenvectors will be 2. Fig. 4 depicts the eigenvectors.

Fig.4 Eigenvectors (Source: Wikipedia)



Since they are calculated from the covariance matrix described before, eigenvectors represent the directions in which the data have more variance. On the other hand, their respective eigenvalues determine the amount of variance that the data set has in that direction.

 4. Principal components
Among all the available eigenvectors that have been calculated in the previous step, we must select those ones onto which we are going to project the data. The selected eigenvectors will be called principal components.
Now question is to which eigen vector to choose?

In order to establish a criterion to select the eigenvectors, we must first define the relative variance of each eigenvector and the total variance of a data set. The relative variance of an eigenvector measures how much information can be attributed to it. The total variance of a data set is the sum of the variance of all the variables.
Here we will find out eigenvector-1 and eigenvector-2 is having around 85% and 15% relative variance respectively.

A common way to select the variables is establish the amount of information that we want the final data set to explain. If this amount of information decreases, the number of principal components that we will select will decrease as well. In this case, as we want to reduce the 2-dimensional data set into a 1-dimensional data set, we will select just the first eigenvector as principal component. As a consequence, the final reduced data set will explain around 85% of the variance of the original one.

5. Reduction of data dimension
Once we have selected the principal components, the data must be projected onto them. The next image shows the result of this projection for our example.


Fig.5 Principal Component (Source: Wikipedia)

Although this projection can explain most of the variance of the original data, we have lost the information about the variance along the second component. In general, this process is irreversible, which means that we cannot recover the original data from the projection.
  
Linear Transformation:
PCA does the linear transformation from one feature space to new feature space.
Let’s understand the linear transformation with the help of matrix example.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.
Av = b
So A turned v into b

In the Fig. 6 we see how the matrix mapped the short, low line v, to the long, high one, b.


 Fig.6 Data mapping to Vectors (Source: Wikipedia)

Imagine that all the input vectors v live in a normal grid, like in Fig. 7:
 Fig.7 Data mapping to Grid (Source: Wikipedia)

And the matrix projects them all into a new space like the one below, which holds the output vectors b:
  
 Fig.8 Linear Transformation (Source: Wikipedia)
  
 The eigenvector tells you the direction the matrix is blowing in.
 Fig.9 Linear Transformation applied on Image (Source: Wikipedia)
So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that that changes length but not direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. The blue line is eigenvector.


 Key Points to remember:

1.  It is simply that the assumptions underlying PCA are linear - and the interpretation is only valid if those assumptions are true. OF course, you can still do a PCA computation on nonlinear data - but the results will be meaningless.

2. Does PCA always lose information? No.
    Does it sometimes lose information? Yes.
You can reconstruct the original data from components. If it always lost information then this would not be possible.It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important.

3. PCA is not a classification method. Remember never use PCA to do classification, but you can use it to imrove performance of classifier.

4. When you apply PCA to your data you are guaranteeing there will be no correlation between the resulting features. Many classification algorithms benefit from it

5. Last but not least.. Always remember...PCA is a feature engineering method. It does not select a set of features and discard other features,  but it infers some new features, which best describe the type of a class.


Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Principal Component Analysis. Few key points are heighlighted at the end of the tutorial. Good Luck!
Further reading!
Are you interested in Deep Learning- Convolutional Neural Network!
1. Document Classification using Deep Learning- Click here
2. Improving Performance of Convolutional Neural Network!  Click here

Are you interested in Correlation -Statistical Analysis! Click here


4 comments: