Principal Component Analysis
Principal
Component Analysis (PCA) is a simple yet popular and useful linear
transformation technique. PCA transforms the data from one feature space to
other feature space of low dimension.
The transformed feature space should be able to explain most of the
variance of the original data set by making a variable reduction. The final
variables will be named as principal component.
Let’s
understand it in simplistic way.
Consider
the Pattern Classification problem.
The pattern
classification problem is divided into two phases, namely, training and
testing.
In the
training phase, first the input data is preprocessed. Then the features are
extracted from the processed data. These features are fed to the classifier.
The classifier is learned through the features. Once the model is trained it
stores the knowledge in the terms of weights. In the testing phase, the trained
model is used to predict the class of test input.
Fig.1
Pattern Classification system (Source: Wikipedia)
The
feature extraction aims at representing the signals by an ideally small number
of relevant values, which describe the task-relevant information contained in
the signals. However, classifiers are able to learn from data which class
corresponds to which input features. So the feature extraction technique plays
very critical role.
All the
extracted features are not useful for classification purpose. After feature
extraction feature dimension reduction takes place. We are interested in the discriminate
features. e.g there are two classes truck and bike. If we consider the feature
as color or number of wheels we are not able to predict the class (bike /truck)
because color may be same and number of wheels are four. Hence we are
interested in discriminate features e.g height. Yes.. That’s what we want .
We need to transform original feature space into new feature space
having maximum variance, hence the dimension reduction takes place in the
transformed new feature space.
There are
two techniques for dimension reduction
1. Linear Discriminant Analysis
2.
Principal Component Analysis
In this
tutorial you will learn about PCA!
The basic
difference between these two is that LDA uses information of classes to find
new features in order to maximize its separability while PCA uses the variance
of each feature to do the same. In this context, LDA can be consider a
supervised algorithm and PCA an unsupervised algorithm.
In this
tutorial you will learn about
1. PCA Working
2. Linear Transformation
3. Key points of PCA
Principal
Component Analysis (PCA):
PCA projects the entire feature space into a
different feature space with reduction in dimensionality.
But
remember PCA does not select a set of features and discard other features, but
it infers some new features, which best describe the type of a class.
PCA Working :
Let’s
start with the working of PCA.
Fig. 2
depicts the flow of data
Fig.2 PCA Working (Source: Wikipedia)
(Hey! I am
not going in mathematics...Here you will find out detail theoretical explanation
with significance of PCA)
Let's start..
Consider the 2D Plot of the data (as shown in Fig.3)
Fig.3 2-D data plot (Source: Wikipedia)
1.
Subtraction of the mean from the data:
As
we can see, the subtraction of the mean results in a translation of the data
which have now zero mean.
2.
Covariance matrix
The
covariance of two random variables measures the degree of variation from their
respective means with respect to each other. The sign of the covariance
provides us with information about the relation between them:
·
If the
covariance is positive, then the two variables increase and decrease together,
·
If the covariance
is negative, then when one variable increases the other decreases and vice
versa.
3.
Eigenvectors and Eigenvalues
Eigenvectors
are defined as those vectors whose directions remain unchanged after any linear
transformation has been applied to them. However, their length could not remain
the same after the transformation, i.e., the result of this transformation is
the vector multiplied by a scalar. This scalar is called eigenvalue and each
eigenvector has one associated to it.
The
number of eigenvectors or components that we can calculate for each data set is
equal to the dimension of the data set. In this case, we have a 2-dimensionalal
data set so the number of eigenvectors will be 2. Fig. 4 depicts the
eigenvectors.
Fig.4 Eigenvectors (Source: Wikipedia)
Since
they are calculated from the covariance matrix described before, eigenvectors
represent the directions in which the data have more variance. On the other
hand, their respective eigenvalues determine the amount of variance that the
data set has in that direction.
Among
all the available eigenvectors that have been calculated in the previous step,
we must select those ones onto which we are going to project the data. The
selected eigenvectors will be called principal components.
Now
question is to which eigen vector to choose?
In
order to establish a criterion to select the eigenvectors, we must first define
the relative variance of each eigenvector and the total variance of a data set.
The relative variance of an eigenvector measures how much information can be attributed
to it. The total variance of a data set is the sum of the variance of all the
variables.
Here
we will find out eigenvector-1 and eigenvector-2 is having around 85% and 15%
relative variance respectively.
A
common way to select the variables is establish the amount of information that
we want the final data set to explain. If this amount of information decreases,
the number of principal components that we will select will decrease as well.
In this case, as we want to reduce the 2-dimensional data set into a
1-dimensional data set, we will select just the first eigenvector as principal
component. As a consequence, the final reduced data set will explain around 85%
of the variance of the original one.
5.
Reduction of data dimension
Once
we have selected the principal components, the data must be projected onto
them. The next image shows the result of this projection for our example.
Fig.5 Principal Component (Source: Wikipedia)
Although
this projection can explain most of the variance of the original data, we have
lost the information about the variance along the second component. In general,
this process is irreversible, which means that we cannot recover the original
data from the projection.
Linear
Transformation:
PCA
does the linear transformation from one feature space to new feature space.
Let’s
understand the linear transformation with the help of matrix example.
Matrices
are useful because you can do things with them like add and multiply. If you
multiply a vector v by a matrix A, you get another vector b,
and you could say that the matrix performed a linear transformation on the
input vector.
Av
= b
So
A turned v into b.
In the Fig. 6 we see how the
matrix mapped the short, low line v, to the long, high one, b.
Imagine
that all the input vectors v live in a normal grid, like in Fig. 7:
Fig.7 Data mapping to Grid (Source: Wikipedia)
And
the matrix projects them all into a new space like the one below, which holds
the output vectors b:
Fig.8 Linear Transformation (Source: Wikipedia)
The
eigenvector tells you the direction the matrix is blowing in.
Fig.9 Linear Transformation applied on Image (Source: Wikipedia)
So
out of all the vectors affected by a matrix blowing through one space, which
one is the eigenvector? It’s the one that that changes length but not
direction; that is, the eigenvector is already pointing in the same
direction that the matrix is pushing all vectors toward. The blue line is
eigenvector.
Key Points to
remember:
1. It is simply that
the assumptions underlying PCA are linear - and the interpretation is only
valid if those assumptions are true. OF course, you can still do a PCA
computation on nonlinear data - but the results will be meaningless.
2.
Does PCA always lose information? No.
Does it sometimes lose information?
Yes.
You can
reconstruct the original data from components. If it always lost information
then this would not be possible.It is useful because it often does not lose
important information when you use it to reduce dimension of your data. When
you lose data it is often the higher frequency data and often that is less
important.
3. PCA is
not a classification method. Remember never use PCA to do classification, but
you can use it to imrove performance of classifier.
4. When you apply PCA to your data you are guaranteeing there will be no correlation between the resulting features. Many classification algorithms benefit from it
5. Last
but not least.. Always remember...PCA is a feature engineering method. It does
not select a set of features and discard other features, but it infers some new features, which best
describe the type of a class.
Go Further!
I hope you enjoyed this post. The
tutorial is very helpful to get the overall idea of Principal Component
Analysis. Few key points are heighlighted at the end of the tutorial. Good
Luck!
Further
reading!
Are
you interested in Deep Learning- Convolutional Neural Network!
1.
Document Classification using Deep Learning- Click here
2.
Improving Performance of Convolutional Neural Network! Click here
Are
you interested in Correlation -Statistical Analysis! Click here