2.1 Covariance
Covariance and Correlation are very helpful in understanding the relationship between two continuous variables
.
Covariance tells whether both variables vary in the same direction (positive covariance) or in the opposite direction (negative covariance).
Covariance(x, y)
$$ Cov(x, y) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})(y_i - \bar{y})}{n-1} $$
Where:
- x̄, ȳ is the sample mean of x, y
- x_i and y_i are the values of x and y for ith record in the sample.
- n is the no of records in the sample
Covariance vs Variance
Variance(x) can be represented by Covariance(x,y)
Variance (s sq)
$$ s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$
OR
$$ Var(x) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$
OR
$$ Cov(x, x) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})(x_i - \bar{x})}{n-1} $$
Advantage vs Disadvantage
- Advantage
- Relation between x & y
- +ve or -ve value
- Relation between x & y
- Disadvantage
- Does not have specific limit value
2. Correlation
2.2.1 Pearson Correation Coefficient
Disadvantage of covariance is removed by having a limit between -1 to +1
.
- The more the value towards +1 the more +ve correlated it is.
- The more the value towards -1 the more -ve correlated it is.
If the correlation value is 0 then it means there is no Linear Relationship between variables however other functional relationship may exist.
$$ Correation{(x,y)} = \frac{Cov(x,y)}{std_x, std_y} = [-1 to 1]$$
OR
$$ \rho_{(x,y)} = \frac{Cov(x,y)}{\sigma_x, \sigma_y} = [-1 to 1]$$
2.2.2 Spearman’s Rank Correlation
Spearman’s rank correlation measures the strength and direction of association between two ranked variables.
$$ \gamma_s = \frac{Cov(R(x), R(y))}{\sigma_{R(x)}, \sigma_{R(y)}} $$
Where:
- R(x) - rank of x
- R(y) - rank of y
2.3 Symmetric Distribution
In statistics, a symmetric distribution is a distribution in which the left and right sides mirror each other
.
The most well-known symmetric distribution is the normal distribution
If you were to draw a line down the center of the distribution, the left and right sides of the distribution would perfectly mirror each other
For symmetric distributions, the skewness is zero.
- The mean, median and mode all are perfectly at the center
- mean = median = mode
Left-Skew/Negative-Skew
- mean < median < mode
Right-Skew/Positive-Skew
- mean > median > mode
2.4 Histogram
A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data
.
The histogram is represented by a set of rectangles
, adjacent to each other, where each bar represent a kind of data.
To construct a histogram from a continuous variable you first need to split the data into intervals, called bins
.
Smoothing Histogram
- By smoothing the Histogram we get
probability distribution function(pdf)
- This is done be
Kernal Density Estimator
What is the difference between a bar chart and a histogram?
The major difference is that a histogram is only used to plot the frequency
of score occurrences in a continuous data set that has been divided into classes, called bins.
Bar charts, on the other hand, can be used for a great deal of other types of variables including ordinal and nominal data sets.
Histograms are based on area, not height of bars
In a histogram, it is the area of the bar that indicates the frequency
of occurrences for each bin.
This means that the height of the bar does not necessarily indicate
how many occurrences of scores there were within each individual bin.
It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.
One of the reasons that the height of the bars is often incorrectly assessed as indicating frequency and not the area of the bar is due to the fact that a lot of histograms often have equally spaced bars (bins), and under these circumstances, the height of the bin does reflect the frequency.