Machine Learning is the Future!: July 2018

Tuesday, July 31, 2018

Principal Component Analysis

Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique. PCA transforms the data from one feature space to other feature space of low dimension. The transformed feature space should be able to explain most of the variance of the original data set by making a variable reduction. The final variables will be named as principal component.

Let’s understand it in simplistic way.

Consider the Pattern Classification problem.

The pattern classification problem is divided into two phases, namely, training and testing.

In the training phase, first the input data is preprocessed. Then the features are extracted from the processed data. These features are fed to the classifier. The classifier is learned through the features. Once the model is trained it stores the knowledge in the terms of weights. In the testing phase, the trained model is used to predict the class of test input.

Fig.1 Pattern Classification system (Source: Wikipedia)

The feature extraction aims at representing the signals by an ideally small number of relevant values, which describe the task-relevant information contained in the signals. However, classifiers are able to learn from data which class corresponds to which input features. So the feature extraction technique plays very critical role.

All the extracted features are not useful for classification purpose. After feature extraction feature dimension reduction takes place. We are interested in the discriminate features. e.g there are two classes truck and bike. If we consider the feature as color or number of wheels we are not able to predict the class (bike /truck) because color may be same and number of wheels are four. Hence we are interested in discriminate features e.g height. Yes.. That’s what we want . We need to transform original feature space into new feature space having maximum variance, hence the dimension reduction takes place in the transformed new feature space.

There are two techniques for dimension reduction

1. Linear Discriminant Analysis

2. Principal Component Analysis

In this tutorial you will learn about PCA!

The basic difference between these two is that LDA uses information of classes to find new features in order to maximize its separability while PCA uses the variance of each feature to do the same. In this context, LDA can be consider a supervised algorithm and PCA an unsupervised algorithm.

In this tutorial you will learn about

1. PCA Working

2. Linear Transformation

3. Key points of PCA

Principal Component Analysis (PCA):

PCA projects the entire feature space into a different feature space with reduction in dimensionality.

But remember PCA does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

PCA Working :

Let’s start with the working of PCA.

Fig. 2 depicts the flow of data

Fig.2 PCA Working (Source: Wikipedia)

(Hey! I am not going in mathematics...Here you will find out detail theoretical explanation with significance of PCA)

Let's start..

Consider the 2D Plot of the data (as shown in Fig.3)

Fig.3 2-D data plot (Source: Wikipedia)

1. Subtraction of the mean from the data:

As we can see, the subtraction of the mean results in a translation of the data which have now zero mean.

2. Covariance matrix

The covariance of two random variables measures the degree of variation from their respective means with respect to each other. The sign of the covariance provides us with information about the relation between them:

· If the covariance is positive, then the two variables increase and decrease together,

· If the covariance is negative, then when one variable increases the other decreases and vice versa.

3. Eigenvectors and Eigenvalues

Eigenvectors are defined as those vectors whose directions remain unchanged after any linear transformation has been applied to them. However, their length could not remain the same after the transformation, i.e., the result of this transformation is the vector multiplied by a scalar. This scalar is called eigenvalue and each eigenvector has one associated to it.

The number of eigenvectors or components that we can calculate for each data set is equal to the dimension of the data set. In this case, we have a 2-dimensionalal data set so the number of eigenvectors will be 2. Fig. 4 depicts the eigenvectors.

Fig.4 Eigenvectors (Source: Wikipedia)

Since they are calculated from the covariance matrix described before, eigenvectors represent the directions in which the data have more variance. On the other hand, their respective eigenvalues determine the amount of variance that the data set has in that direction.

4. Principal components

Among all the available eigenvectors that have been calculated in the previous step, we must select those ones onto which we are going to project the data. The selected eigenvectors will be called principal components.

Now question is to which eigen vector to choose?

In order to establish a criterion to select the eigenvectors, we must first define the relative variance of each eigenvector and the total variance of a data set. The relative variance of an eigenvector measures how much information can be attributed to it. The total variance of a data set is the sum of the variance of all the variables.

Here we will find out eigenvector-1 and eigenvector-2 is having around 85% and 15% relative variance respectively.

A common way to select the variables is establish the amount of information that we want the final data set to explain. If this amount of information decreases, the number of principal components that we will select will decrease as well. In this case, as we want to reduce the 2-dimensional data set into a 1-dimensional data set, we will select just the first eigenvector as principal component. As a consequence, the final reduced data set will explain around 85% of the variance of the original one.

5. Reduction of data dimension

Once we have selected the principal components, the data must be projected onto them. The next image shows the result of this projection for our example.

Fig.5 Principal Component (Source: Wikipedia)

Although this projection can explain most of the variance of the original data, we have lost the information about the variance along the second component. In general, this process is irreversible, which means that we cannot recover the original data from the projection.

Linear Transformation:

PCA does the linear transformation from one feature space to new feature space.

Let’s understand the linear transformation with the help of matrix example.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.

Av = b

So A turned v into b.

In the Fig. 6 we see how the matrix mapped the short, low line v, to the long, high one, b.

Fig.6 Data mapping to Vectors (Source: Wikipedia)

Imagine that all the input vectors v live in a normal grid, like in Fig. 7:

Fig.7 Data mapping to Grid (Source: Wikipedia)

And the matrix projects them all into a new space like the one below, which holds the output vectors b:

Fig.8 Linear Transformation (Source: Wikipedia)

The eigenvector tells you the direction the matrix is blowing in.

Fig.9 Linear Transformation applied on Image (Source: Wikipedia)

So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that that changes length but not direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. The blue line is eigenvector.

Key Points to remember:

1. It is simply that the assumptions underlying PCA are linear - and the interpretation is only valid if those assumptions are true. OF course, you can still do a PCA computation on nonlinear data - but the results will be meaningless.

2. Does PCA always lose information? No.

Does it sometimes lose information? Yes.

You can reconstruct the original data from components. If it always lost information then this would not be possible.It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important.

3. PCA is not a classification method. Remember never use PCA to do classification, but you can use it to imrove performance of classifier.

4. When you apply PCA to your data you are guaranteeing there will be no correlation between the resulting features. Many classification algorithms benefit from it

5. Last but not least.. Always remember...PCA is a feature engineering method. It does not select a set of features and discard other features, but it infers some new features, which best describe the type of a class.

Go Further!

I hope you enjoyed this post. The tutorial is very helpful to get the overall idea of Principal Component Analysis. Few key points are heighlighted at the end of the tutorial. Good Luck!

Further reading!

Are you interested in Deep Learning- Convolutional Neural Network!

1. Document Classification using Deep Learning- Click here

2. Improving Performance of Convolutional Neural Network! Click here

Are you interested in Correlation -Statistical Analysis! Click here

Wednesday, July 18, 2018

Correlation - Statistical Analysis!

The most important step in computer vision or machine learning is to understand data well and use that knowledge to make the best design choice.

The open question is

How to understand data well?

The answer is by applying statistical techniques...

Hence the red theme of this tutorial is to understand most important statistical technique i.e Correlation.

The word correlation is used in everyday life to denote some form of association. It is a statistical technique that can show whether and how strongly pairs of variables are related. We might say that we have noticed correlation between student attendance and marks obtained. However, in statistical terms we use correlation to denote association between two quantitative variables. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Fig. 1 depicts the positive, negative and no correlation.

Fig.1 Types of Correlation (Source:Wikipedia)

What is the significance of correlation?

In machine learning before applying any classifier we must find out correlation of intra-intent and inter-intent patterns. The correlation of intra-intent patterns is obviously higher than inter-intent patterns. For example patterns of the same class have more association binding than patterns of different classes. It is good practice to decide hypothesis of your problem statement w.r.t. correlation and check whether really the problem is of pattern classification.

In this tutorial you will learn about

1. Correlation Coefficient

2. Pearson vs. Spearman correlation technique

3. Views from Applied Perspective

1. Correlation coefficient

The degree of association is measured by correlation coefficient. A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation. The Table 1 describes the strength of relationship.

Table 1. Strength of Relationship

How to get this r value..

There are different correlation techniques to get this r

Let’s start with interesting stuff...

2. Pearson vs. Spearman correlation technique

Pearson correlation is parametric whilst Spearman correlation is nonparametric test.

First understand the difference in parametric Vs. Nonparametric test.

Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The following formula is used to calculate the Pearson r correlation:

r = Pearson r correlation coefficient
N = number of observations
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores

Key Points :

1. In general, when the data is normally distributed we are using Pearson correlation. The normal distribution is always symmetrical about mean which looks like bell curve.

2. Linearity is not the assumption of Pearson correlation. Pearson correlation determines the degree to which relation is linear. The relation is linear if variables increase or decrease at constant speed.

Python script to compute Pearson correlation coefficient

>>> import matplotlib.pyplot as plt

>>> from scipy import stats

>>> np.random.seed(12345678)

>>> x = np.random.random(10)

>>> y = np.random.random(10)

>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

where,

slope- It is the slope of the regression line.

intercept- It is the intercept of the regression line.

r-value – It is the pearson correlation value. The r value is in between -1 to 1.

P-value- The P-value is a critical value depends on the probability you are allowing for a Type-I error. It is also called as hypothesis test since it tells whether to accept or reject Null hypothesis. (Null hypothesis is a hypothesis that says there is no statistical significance between two variables. It is a hypothesis a resercher will try to disprove)

In general if p<0.05(critical value) reject null hypothesis else accept it.

Std_err- it is the standard error of the estimate.

More explanation about r-value and p-value

r-value tells about the variation within data.

P-value tells about significance of model(i.e model fits the data well)

Let’s understand the Four possibilities:

1. r-value(low) and p-value(low) – Model doesn’t explain much about variation, but is significant. (Better than nothing)

2. r-value(low) and p-value(high) – Model doesn’t explain much about variation and not significant (Worst model)

3. r-value(high) and p-value(low) – Model tells much about variation and significant. (Best model)

4. r-value(high) and p-value(high) – Model explains well about variation but not significant. (Worthless)

Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

The following formula is used to calculate the Spearman rank correlation:

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables
n= number of observations

Key Points :

1. In Spearman Rank correlation information loss is there since it is working on ranks.

2. In general, for monotonic relationship between variables Spearman rank correlation is used. In Monotonic relation the variables tend to move in the same direction but not at constant speed.

3. If the data has outliers i.e few values are far away from others use Spearman rank correlation coefficient.

Python script to compute Spearman Rank correlation coefficient.

>>> from scipy import stats

>>> np.random.seed(12345678)

>>> x = np.random.random(10)

>>> y = np.random.random(10)

>>> scipy.stats.stats.spearmanr(x1,y1)

3. Views from Applied Perspective

Here are few suggestions from applied perspective:

1. Before taking decision whether to apply Pearson or Spearman rank correlation it is good practice to look at the scatterplot

>>> python script to plot scatterplot

>>> import numpy as np

>>> import matplotlib.pyplot as plt

# Fixing random state for reproducibility

>>> np.random.seed(19680801)

>>> N = 50

>>> x = np.random.rand(N)

>>> y = np.random.rand(N)

>>> colors = np.random.rand(N)

>>> area = (30 * np.random.rand(N))**2 # 0 to 15 point radii

>>> plt.scatter(x, y, s=area, c=colors, alpha=0.5)

>>> plt.show()

2. For small sample I will advise to use Spearman rank correlation.

3. For the large sample use Pearson correlation.

Last One

I prefer Pearson correlation coefficient because

1. Pearson correlation is having more statistical power.

2. Pearson correlation enables more direct compatibility of finding across studies, because most of the studies report Pearson correlation.

3, In many cases there is minimal difference between Pearson and Spearman correlation coefficient.

4. Obviously it aligns with my theoretical interests.

Go Further!

I hope you enjoyed this post. The tutorial is good to start statistical analysis using Correlation. The post is very informative not only to get the knowledge of Pearson and Spearman rank correlation but also from the applicability perspective. Good Luck!

Worth reading!

Are you interested in Deep Learning- Convolutional Neural Network!

1. Document Classification using Deep Learning- Click here

2. Improving Performance of Convolutional Neural Network! Click here

Thursday, July 5, 2018

Improving Performance of Convolutional Neural Network!

Convolutional Neural Network – a pillar algorithm of deep learning -- has been one of the most influential innovations in the field of computer vision. They have performed a lot better than traditional computer vision algorithms. These neural networks have proven to be successful in many different real-life case studies and applications, like:

· Image classification, object detection, segmentation, face recognition;

· Classification of crystal structure using a convolutional neural network;

· Self driving cars that leverage CNN based vision systems;

· And many more, of course!

Lot of articles are available on how to build Convolution Neural Network. Hence, I am not going in detail regarding implementation of CNN. If you are interested in Document Classification using CNN please click here

The red theme of this tutorial is to know how to improve performance of CNN?

Let’s start ...

The common question is:

How can I get better performance from deep learning model?

It might be asked as:

How can I improve accuracy?

Oh God! My CNN is performing poor..

Don’t be stressed..

Here is the tutorial ..It will give you certain ideas to lift the performance of CNN.

The list is divided into 4 topics

1. Tune Parameters

2. Image Data Augmentation

3. Deeper Network Topology

4. Handel Overfitting and Underfitting problem

Oh! Cool.. Let’s start with explanation

1. Tune Parameters

To improve CNN model performance, we can tune parameters like epochs, learning rate etc.. Number of epochs definitely affect the performance. For large number of epochs , there is improvement in performance. But need to do certain experimentation for deciding epochs, learning rate. We can see after certain epochs there is not any reduction is training loss and improvement in training accuracy. Accordingly we can decide number of epochs. Also we can use dropout layer in the CNN model. As per the application, need to decide proper optimizer during compilation of model. We can use various optimizer e.g SGD,rmsprop etc. There is need to tune model with various optimizers . All these things affect the performance of CNN.

2. Image Data Augmentation

"Deep learning is only relevant when you have a huge amount of data". It’s not wrong. CNN requires the ability to learn features automatically from the data, which is generally only possible when lots of training data is available.

If we have less training data available.. what to do?

Solution is here.. use Image Augmentation

Image augmentation parameters that are generally used to increase the data sample count are zoom, shear, rotation, preprocessing function and so on. Usage of these parameters results in generation of images having these attributes during training of Deep Learning model. Image samples generated using image augmentation, in general existing data samples increased by the rate of nearly 3x to 4x times.

Fig.1 Data Augmentation (source: wikipedia)

One more advantage of data augmentation is as we know CNN is not rotation invariant, using augmentation we can add the images in the dataset by considering rotation. Definitely it will increase the accuracy of system.

3. Deeper Network Topology

Now let’s start to talk on wide network vs deep network!

A wide neural network is possible to train with every possible input value. Hence, these networks are very good at good at memorization, but not so good at generalization. There are, however, a few difficulties with using an extremely wide, shallow network. Though, wide neural network is able to accept every possible input value, in the practical application we won’t have every possible value for training.

Deeper networks capture the natural “hierarchy” that is present everywhere in nature. See a convnet for example, it captures low level features in first layer, a little better but still low level features in the next layer and at higher layers object parts and simple structures are captured. The advantage of multiple layers is that they can learn features at various levels of abstraction.

So that explains why you might use a deep network rather than a very wide but shallow network.

But why not a very deep, very wide network?

The answer is we want our network to be as small as possible to produce good results. The wider network will take longer time to train. Deep networks are very computationally expensive to train. Hence, make them wide and deep enough that they work well, but no wider and deeper.

4. Handel Overfitting and Underfitting problem

In order to talk on overfitting and underfitting, let’s start with simple concept e.g Model. What is model? It is a system which maps input to output. e.g we can generate a model of image classification which takes test input image and predicts class label for it. It’s interesting!

To generate a model we divide dataset into training and testing set. We train our model with classifier e.g CNN on training set . Then we can use trained model for predicting output of test data.

Now what is Overfitting and Underfitting?

Overfitting refers to a model that models the training data too well. What is the meaning of it. Lets simplify... In the overfitting your model gives very nice accuracy on trained data but very less accuracy on test data. The meaning of this is overfitting model is having good memorization ability but less generalization ability. Our model doesn’t generalize well from our training data to unseen data.

Underfiiting refers to a model which works well on the testing data. Its very dangerous..isn’t it? Model is having good accuracy on test data, but less accuracy on training data.

In the technical terms a model that overfits has low bias and high variance. A model that underfits has high bias and less variance. In any modeling, there will always be a tradeoff between bias and variance and when we build models, we try to achieve the best balance.

Now what is bias and variance?

Bias is error w.r.t training set. Variance is how much a model changes in response to the training data. The meaning of variance is model doesn’t give good accuracy on test data.

Fig.2 Underfitting Vs. Overfitting (Source: Wikipedia)

How to Prevent Underfitting and Overfitting?

Let’s start with Underfitting:

The Example of underfiiting is your model is giving 50% accuracy on train data and 80% accuracy on test data?

Its the worst problem..

Why it occurs?

The answer is Underfitting occurs when a model is too simple – informed by too few features or regularized too much – which makes it inflexible in learning from the dataset.

Solution...

I would suggest if there is underfitting, focus on the level of deepness of the model. You may need to add layers.. as it will give you more detailed features. As we discussed above you need to tune parameters to avoid Underfitting.

Overfitting:

The Example of overfiiting is your model is giving 99% accuracy on train data and 60% accuracy on test data?

Overfitting is a common problem in machine learning..

There are certain solutions to avoid overfitting

1. Train with more data

2. Early stopping:

3. Cross validation

let’s start to discuss

1.Train with more data:

Train with more data helps to increase accuracy of mode. Large training data may avoid the overfitting problem. In CNN we can use data augmentation to increase the size of training set.

2. Early stopping:

System is getting trained with number of iterations. Model is improved through each new iteration .. But wait.. after certain number of iterations model starts to overfit the training data. Hence, the model’s generalization ability can be weaken. So do the Early stopping. Early stopping refers stopping the training process before the learner passes that point.

Fig.3 Early Stopping (Source: Wikipedia)

3. Cross validation:

Cross validation is a nice technique to avoid overfitting problem.

What is cross validation?

Let’s start with k-fold cross validation. (where k is any integer number)

Partition the original training data set into k equal subsets. Each subset is called a fold. Let the folds be named as f₁, f₂, …, f_k .

· For i = 1 to i = k

· Keep the fold f_i as Validation set and keep all the remaining k-1 folds in the Cross validation training set.

· Train your machine learning model using the cross validation training set and calculate the accuracy of your model by validating the predicted results against the validation set.

· Estimate the accuracy of your machine learning model by averaging the accuracies derived in all the k cases of cross validation.

Fig.4 5-fold Cross Validation(Source: Wikipedia)

Fig. 4 describes 5-fold cross validation, where training dataset is divided into 5 equal sub-datsets. There are 5 iterations. In each iteration 4 sub-datasets are used for training whilst one sub-dataset is used for testing.

Cross-validation is definitely helpful to reduce overfitting problem.

Go Further!

I hope you enjoyed this post. The tutorial is good to understand how we can improve performance of CNN model..While these concepts may feel overwhelming at first, they will ‘click into place’ once you start seeing them in the context of real-world code and problems. If you are able to follow the things in the post easily or even with little more efforts, well done! Try doing some experiments ... Good Luck!