Wednesday, July 18, 2018


Correlation - Statistical Analysis!

The most important step in computer vision or machine learning is to understand data well and use that knowledge to make the best design choice.

The open question is
How to understand data well?

The answer is by applying statistical techniques...

Hence the red theme of this tutorial is to understand most important statistical technique i.e Correlation.

The word correlation is used in everyday life to denote some form of association. It is a statistical technique that can show whether and how strongly pairs of variables are related. We might say that we have noticed correlation between student attendance and marks obtained. However, in statistical terms we use correlation to denote association between two quantitative variables. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Fig. 1 depicts the positive, negative and no correlation.

                                              Fig.1 Types of Correlation (Source:Wikipedia)


What is the significance of correlation?

In machine learning before applying any classifier we must find out correlation of intra-intent and inter-intent patterns. The correlation of intra-intent patterns is obviously higher than inter-intent patterns. For example patterns of the same class have more association binding than patterns of different classes. It is good practice to decide hypothesis of your problem statement w.r.t. correlation and check whether really the problem is of pattern classification.


In this tutorial you will learn about
1. Correlation Coefficient
2. Pearson vs. Spearman correlation technique
3. Views from Applied Perspective

1. Correlation coefficient

The degree of association is measured by correlation coefficient. A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation. The Table 1 describes the strength of relationship.

                                                Table 1. Strength of Relationship
 

How to get this r value..

There are different correlation techniques to get this r

Let’s start with interesting stuff...

2. Pearson vs. Spearman correlation technique

Pearson correlation is parametric whilst Spearman correlation is nonparametric test.

First understand the difference in parametric Vs. Nonparametric test.



Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The following formula is used to calculate the Pearson r correlation:


r = Pearson r correlation coefficient
N = number of observations
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores

Key Points :
1. In general, when the data is normally distributed we are using Pearson correlation. The normal distribution is always symmetrical about mean which looks like bell curve.

2. Linearity is not the assumption of Pearson correlation. Pearson correlation determines the degree to which relation is linear. The relation is linear if variables increase or decrease at constant speed.

Python script to compute Pearson correlation coefficient

>>> import matplotlib.pyplot as plt
>>> from scipy import stats
>>> np.random.seed(12345678)
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
where,
slope- It is the slope of the regression line.
intercept- It is the intercept of the regression line.
r-value – It is the pearson correlation value. The r value is in between -1 to 1.
P-value- The P-value is a critical value depends on the probability you are allowing for a Type-I error. It is also called as hypothesis test since it tells whether to accept or reject Null hypothesis. (Null hypothesis is a hypothesis that says there is no statistical significance between two variables. It is a hypothesis a resercher will try to disprove)
In general if p<0.05(critical value) reject null hypothesis else accept it.
Std_err- it is the standard error of the estimate.
More explanation about r-value and p-value
r-value tells about the variation within data.
P-value tells about significance of model(i.e model fits the data well)
Let’s understand the Four possibilities:
1. r-value(low) and p-value(low) – Model doesn’t explain much about variation, but is significant. (Better than nothing)
2. r-value(low) and p-value(high) – Model doesn’t explain much about variation and not significant (Worst model)
3. r-value(high) and p-value(low) – Model tells much about variation and significant. (Best model)
4.  r-value(high) and p-value(high) – Model explains well about variation but not significant. (Worthless)

Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation:
ρ= Spearman rank correlation
di= the difference between the ranks of corresponding variables
n= number of observations
Key Points :
1. In Spearman Rank correlation information loss is there since it is working on ranks.

2. In general, for monotonic relationship between variables Spearman rank correlation is used.  In Monotonic relation the variables tend to move in the same direction but not at constant speed.

3. If the data has outliers i.e few values are far away from others use Spearman rank correlation coefficient.


Python script to compute Spearman Rank correlation coefficient.

>>> from scipy import stats
>>> np.random.seed(12345678)
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> scipy.stats.stats.spearmanr(x1,y1)

3. Views from Applied Perspective

Here are few suggestions from applied perspective:
1. Before taking decision whether to apply Pearson or Spearman rank correlation it is  good practice to look at the scatterplot
>>> python script to plot scatterplot
>>> import numpy as np
>>> import matplotlib.pyplot as plt
# Fixing random state for reproducibility
>>> np.random.seed(19680801)
>>> N = 50
>>> x = np.random.rand(N)
>>> y = np.random.rand(N)
>>> colors = np.random.rand(N)
>>> area = (30 * np.random.rand(N))**2  # 0 to 15 point radii
>>> plt.scatter(x, y, s=area, c=colors, alpha=0.5)
>>> plt.show()
2. For small sample I will advise to use Spearman rank correlation.
3. For the large sample use Pearson correlation.

Last One
I prefer Pearson correlation coefficient because
1. Pearson correlation is having more statistical power.
2. Pearson correlation enables more direct compatibility of finding across studies, because most of the studies report Pearson correlation.
3, In many cases there is minimal difference between Pearson and Spearman correlation coefficient.
4. Obviously it aligns with my theoretical interests.

Go Further!
I hope you enjoyed this post. The tutorial is good to start statistical analysis using Correlation. The post is very informative not only to get the knowledge of Pearson and Spearman rank correlation but also from the applicability perspective. Good Luck!

Worth reading!
Are you interested in Deep Learning- Convolutional Neural Network!
1. Document Classification using Deep Learning- Click here
2. Improving Performance of Convolutional Neural Network!  Click here












1 comment: