Correlation - Statistical Analysis!
The most important step in computer vision or machine
learning is to understand data well and use that knowledge to make the best
design choice.
The open question is
How to understand data well?
The answer is by applying statistical techniques...
Hence the red theme of this tutorial is to understand
most important statistical technique i.e Correlation.
The word correlation is used in everyday life to denote
some form of association. It is a statistical technique that can show whether
and how strongly pairs of variables are related. We might say that we have
noticed correlation between student attendance and marks obtained. However, in
statistical terms we use correlation to denote association between two
quantitative variables. When one variable increases as the other increases the
correlation is positive; when one decreases as the other increases it is
negative. Fig. 1 depicts the positive, negative and no correlation.
Fig.1 Types of Correlation (Source:Wikipedia)
What is the significance of correlation?
In machine learning before applying any classifier we
must find out correlation of intra-intent and inter-intent patterns. The
correlation of intra-intent patterns is obviously higher than inter-intent
patterns. For example patterns of the same class have more association binding
than patterns of different classes. It is good practice to decide hypothesis of
your problem statement w.r.t. correlation and check whether really the problem
is of pattern classification.
In this tutorial you will learn about
1. Correlation Coefficient
2. Pearson vs. Spearman correlation technique
3. Views from Applied Perspective
1. Correlation coefficient
The degree of association is measured by correlation
coefficient. A correlation
coefficient is a way to put a value to the relationship.
Correlation coefficients have a value of between -1 and 1. A “0” means there is
no relationship between the variables at all, while -1 or 1 means that
there is a perfect negative or positive correlation. The Table 1
describes the strength of relationship.
Table 1. Strength of Relationship
How to get this r value..
There are different correlation techniques to get this r
Let’s start with interesting stuff...
2. Pearson vs. Spearman
correlation technique
Pearson correlation is parametric whilst Spearman
correlation is nonparametric test.
First understand the difference in parametric Vs.
Nonparametric test.
Pearson r correlation: Pearson r correlation is the most widely used
correlation statistic to measure the degree of the relationship between
linearly related variables. For example, in the stock market, if we want to
measure how two stocks are related to each other, Pearson r correlation
is used to measure the degree of relationship between the two. The following
formula is used to calculate the Pearson r correlation:
r = Pearson r correlation coefficient
N = number of observations
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores
N = number of observations
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores
Key Points :
1. In general, when the data is normally distributed we
are using Pearson correlation. The normal distribution is always symmetrical
about mean which looks like bell curve.
2. Linearity is not the assumption of Pearson
correlation. Pearson correlation determines the degree to which relation is linear. The relation
is linear if variables increase or decrease at constant speed.
Python script to compute Pearson correlation coefficient
>>> import matplotlib.pyplot as plt
>>> from scipy import stats
>>> np.random.seed(12345678)
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err =
stats.linregress(x, y)
where,
slope- It is the slope of the regression line.
intercept- It is the intercept of the regression line.
r-value – It is the pearson correlation value. The r value is in between -1
to 1.
P-value- The P-value is a critical value depends on the probability you are
allowing for a Type-I error. It is also called as hypothesis test since it
tells whether to accept or reject Null hypothesis. (Null hypothesis is a
hypothesis that says there is no statistical significance between two
variables. It is a hypothesis a resercher will try to disprove)
In general if p<0.05(critical value) reject null hypothesis else accept
it.
Std_err- it is the standard error of the estimate.
More explanation about r-value and p-value
r-value tells about the variation within data.
P-value tells about significance of model(i.e model fits the data well)
Let’s understand the Four possibilities:
1. r-value(low) and p-value(low) – Model doesn’t explain much about
variation, but is significant. (Better than nothing)
2. r-value(low) and p-value(high) – Model doesn’t explain much about variation
and not significant (Worst model)
3. r-value(high) and p-value(low) – Model tells much about variation and
significant. (Best model)
4. r-value(high) and p-value(high) –
Model explains well about variation but not significant. (Worthless)
Spearman rank correlation: Spearman
rank correlation is a non-parametric test that is used to measure the degree of
association between two variables. The Spearman rank correlation test does not
carry any assumptions about the distribution of the data and is the appropriate
correlation analysis when the variables are measured on a scale that is at
least ordinal.
The following formula is used to calculate the Spearman
rank correlation:
di= the difference between the ranks of corresponding variables
n= number of observations
n= number of observations
Key Points :
1. In Spearman Rank correlation information loss is there
since it is working on ranks.
2. In general, for monotonic relationship between
variables Spearman rank correlation is used.
In Monotonic relation the variables tend to move in the same direction
but not at constant speed.
3. If the data has outliers i.e few values are far away
from others use Spearman rank correlation coefficient.
Python script to compute Spearman Rank correlation coefficient.
>>> from scipy import stats
>>> np.random.seed(12345678)
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> scipy.stats.stats.spearmanr(x1,y1)
3. Views from Applied Perspective
Here are few suggestions from applied perspective:
1. Before taking decision whether to apply Pearson or Spearman rank
correlation it is good practice to look
at the scatterplot
>>> python script to plot scatterplot
>>> import numpy as np
>>> import matplotlib.pyplot as plt
# Fixing random state for reproducibility
>>> np.random.seed(19680801)
>>> N = 50
>>> x = np.random.rand(N)
>>> y = np.random.rand(N)
>>> colors = np.random.rand(N)
>>> area = (30 * np.random.rand(N))**2 # 0 to 15 point radii
>>> plt.scatter(x, y, s=area, c=colors, alpha=0.5)
>>> plt.show()
2. For small sample I will advise to use Spearman rank correlation.
3. For the large sample use Pearson correlation.
Last One
I prefer Pearson
correlation coefficient because
1. Pearson correlation is having more statistical power.
2. Pearson correlation enables more direct compatibility of finding across
studies, because most of the studies report Pearson correlation.
3, In many cases there is minimal difference between Pearson and Spearman
correlation coefficient.
4. Obviously it aligns with my theoretical interests.
Go Further!
I hope you enjoyed this post. The tutorial is good to
start statistical analysis using Correlation. The post is very informative not
only to get the knowledge of Pearson and Spearman rank correlation but also
from the applicability perspective. Good Luck!
Worth reading!
Are you interested in Deep Learning- Convolutional Neural Network!
1. Document Classification using Deep Learning- Click here
2. Improving Performance of Convolutional Neural Network! Click here
Really good explanation. Thanks a lot!!
ReplyDelete