02-Statistics

#ML-Stats #Stats #Maths

2.1 Covariance

Covariance and Correlation are very helpful in understanding the relationship between two continuous variables.

Covariance tells whether both variables vary in the same direction (positive covariance) or in the opposite direction (negative covariance).

Covariance(x, y) $Cov(x, y) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})(y_i - \bar{y})}{n-1}$

Where:

x̄, ȳ is the sample mean of x, y
x_i and y_i are the values of x and y for ith record in the sample.
n is the no of records in the sample

Covariance vs Variance

Variance(x) can be represented by Covariance(x,y)

Variance (s sq)

$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$

$Var(x) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$

$Cov(x, x) = \sum_{i=1}^{n} \frac{(x_i - \bar{x})(x_i - \bar{x})}{n-1}$

Advantage vs Disadvantage

Advantage
- Relation between x & y
  - +ve or -ve value
Disadvantage
- Does not have specific limit value

2. Correlation

2.2.1 Pearson Correation Coefficient

Disadvantage of covariance is removed by having a limit between -1 to +1.

The more the value towards +1 the more +ve correlated it is.
The more the value towards -1 the more -ve correlated it is.

If the correlation value is 0 then it means there is no Linear Relationship between variables however other functional relationship may exist.

$Correation{(x,y)} = \frac{Cov(x,y)}{std_x, std_y} = [-1 to 1]$

$\rho_{(x,y)} = \frac{Cov(x,y)}{\sigma_x, \sigma_y} = [-1 to 1]$

Correation Coefficient

2.2.2 Spearman’s Rank Correlation

Spearman’s rank correlation measures the strength and direction of association between two ranked variables.

$\gamma_s = \frac{Cov(R(x), R(y))}{\sigma_{R(x)}, \sigma_{R(y)}}$

Where:

R(x) - rank of x
R(y) - rank of y

2.3 Symmetric Distribution

In statistics, a symmetric distribution is a distribution in which the left and right sides mirror each other.

The most well-known symmetric distribution is the normal distribution

If you were to draw a line down the center of the distribution, the left and right sides of the distribution would perfectly mirror each other

For symmetric distributions, the skewness is zero.

The mean, median and mode all are perfectly at the center
mean = median = mode

Left-Skew/Negative-Skew

mean < median < mode

Right-Skew/Positive-Skew

mean > median > mode

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. The histogram is represented by a set of rectangles, adjacent to each other, where each bar represent a kind of data.

To construct a histogram from a continuous variable you first need to split the data into intervals, called bins.

Smoothing Histogram

By smoothing the Histogram we get probability distribution function(pdf)
This is done be Kernal Density Estimator

histogram smoothing

What is the difference between a bar chart and a histogram?

The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins.

Bar charts, on the other hand, can be used for a great deal of other types of variables including ordinal and nominal data sets.

Histograms are based on area, not height of bars

In a histogram, it is the area of the bar that indicates the frequency of occurrences for each bin. This means that the height of the bar does not necessarily indicate how many occurrences of scores there were within each individual bin.

It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.

One of the reasons that the height of the bars is often incorrectly assessed as indicating frequency and not the area of the bar is due to the fact that a lot of histograms often have equally spaced bars (bins), and under these circumstances, the height of the bin does reflect the frequency.

Reference

https://towardsdatascience.com/covariance-and-correlation-321fdacab168

02-Statistics

Covariance, Correlation, Symmetric Distribution, Histogram

02-Statistics

Covariance, Correlation, Symmetric Distribution, Histogram

2.1 Covariance

Covariance vs Variance

Advantage vs Disadvantage

2. Correlation

2.2.1 Pearson Correation Coefficient

2.2.2 Spearman’s Rank Correlation

2.3 Symmetric Distribution

2.4 Histogram

Smoothing Histogram

What is the difference between a bar chart and a histogram?

Histograms are based on area, not height of bars

Reference