Correlation and Regression
Lecture 10

Correlation

Correlation – measure of linear association between two variables

Pearson’s correlation
1. Key word is “linear”

Equation 2

cov shows how two variables vary together
If X = Y, then

Equation 3

var(X) is population variance
If we estimate cov(X, Y), var(X), and var(Y) by population formulas, then

Equation 4

Correlation ranges [-1, +1]
1. If r = -1, then

Perfect Llinear Negative Correlation

If r = +1, then

Perfect Positive Correlatoin

If r = 0, then

Zero Correlation

Note – exponential function would have a high correlation, but it is not linear

Correlation of an Exponential Function

First step
1. Construct a scatter diagram of the data
2. What does the data look like?
Define
1. X is the independent variable
2. Y is the dependent variable
3. Define the function, Y = f(X)
  1. Y is a function of X
  2. Establishes causality
  3. Correlation does not establish causality

Example – Chest pains and shortness of breath may have a high positive correlation
1. One does not cause the other
2. Clogged arteries can cause these two symptoms
3. Be very careful, correlation does not establish causality

An Example showing Causality

Correlation t-statistic
1. Assumptions
  1. Both variables are normally distributed
  2. Both have a linear relationship
2. Hypothesis test

Equation 5

The t-test is:

Equation 6

Notation
1. r is the correlation coefficient
2. n is the number of observations
3. df = n – 2
4. The 2 is because correlation involves two variables

Spearman Rank Correlation
1. Reduces problems with outliers
2. Can handle non-linear relationships, like exponential functions

Regression Equation

You are imposing a relationship onto the variables

Y _i is the dependent variable
1. Value is obtained from X _i and u _i
X _i is the independent variable
b ₁ and b ₂ are parameters
1. These are estimated from the data
u _i is the random noise term, u _i ~ N(0, s ²)
1. Notation
  1. N is normally distributed
  2. Mean is 0
  3. Variance is s ²
2. Having normally distributed noise allows us to calculate confidence intervals and perform hypothesis testing
3. Estimates of b ₁ and b ₂ and predictions of Y _i are influenced by the variance of the noise

Linear Regression

We find the and that minimizes the errors for the data points
1. Some errors are positive
2. Other errors are negative
3. We cannot add the errors because they may cancel
4. We square the errors terms to make them all positive
Derivation

Starting with the equation

Solve for u _i, which yields

Square the errors to make them positive

This is only for one data point. We want to minimize the total errors of all the data points.
Sum over all the data points
Define Sum of Squared Errors (SSE)

Equation 12

We want to find the minimum, thus we take the first partial derivatives with respect to the betas

Equation 13

The second step is the Chain Rule from Calculus. I can put the 2 in front of the summation because each term in the summation has a 2. Set the partial derivative to zero, in order to find minimum value

Equation 14

Now solve equation for b ₁, It is debatable when you should add hats to the estimators. I added at this step when partial was set to zero

Equation 15

Summation is a linear operator. We can apply the summation to all terms in parenthesis
     b ₁ is constant that is summed n times
     b ₂ is constant and multiplied by all X’s in summation
     Can bring this to the front

Equation 16

The last step works because we substitute the average for y and average for x into the equation. Repeating these steps to get the estimator for b ₂

Equation 17

Similarly, set the partial to zero and solve for b ₂,

Equation 18

We substitute the estimator for b ₁ into the equation, , which yields

Equation 20

I did not break the last summation apart. This is to solve for the estimator for b ₂

Equation 21

This is only good for one X variable. We can generalize least squares to Multiple Regression. There is k parameters to estimate.

Example – Demand for Pepsi
1. Q is quantity and P is market price

Use least squares to find betas
I fitted a line through my data points that gives me the best fit

Goodness of Fit

The goodness-of-fit measure is, R ².

If R ² = 0, then the regression equation has no fit
If R ² = 1, then a regression equation has a perfect linear fit
Also, n = k which is algebraic system
Problem – As the number of x variables increases, R ² always gets larger

Adjusted R ² - Penalize the goodness of fit if more variables are added

Equation 32 .

As the number of independent variables increase, the penalty increases, but the error could decrease if new variables explain ‘y’ better.
Sometimes can be negative, indicating a very poor fit
Note – Very important; it has to be the same y variable
One model it cannot be y and in another it is ln (y)

Analysis of Variance )ANOVA)

In terms of regressions, ANOVA is used to test hypothesis in many types of statistical analysis
Sum of Squared Total (SST) is defined as:

Equation 25 .

Y _i is the dependent variable in the regression. The is the total variation for observation i.
Sum of Squared Regression (SSR) is defined as:

Equation 27 .

This is the variation explained by the regression
Sum of Squared Errors (SSE), which was earlier defined as:

Equation 28

SSE is the amount of variation not explained by the regression equation. Thus, SST = SSR + SSE, which is proved in the lecture.
We can use this information to calculate the R ² statistic, showing the relationship:

Equation 30

Problem – the more parameters added to the regression, the higher the R ².

R ² = 1, if n = k, the number of parameters equal observations

Now we need the degrees of freedom for each measure:

Sum of Squared Regression (SSR) df =k – 1

Sum of Squared Errors (SSE) df = n – k

Sum of Squared Total (SST) df = n – 1

We calculate the Mean Square (MS)

Regression (MS) = SSR / (k – 1)

Residual (MS) =SSE / (n – k)

Total (MS) NA

Additional information
- When you have a variable with a normal distribution
  - If you add or subtract if from other variables with a normal distribution, then it is still normally distributed
  - Calculating a mean is a first moment
- If you square a random variable with a normal distribution, then you get a chi-square distribution with degrees of freedom.
  - The squares are variances and called the second moment
  - All the Mean Squares are distributed as chi squares
AF-distribution – can test a whole group of hypothesis or test a whole regression model
- F- test can test many other things
- The F-test is a ratio of two chi-squares
- The F-test is a one-tailed test associated with the right-hand tail.
- Squaring makes all terms positive
The F-distribution and test is as follows:

The F Distribution

The hypothesis test

H ₀: Regression model does not explain the data, i.e. all the parameters estimates are zero

H _a: Regression model does explain the model, i.e. at least one parameter estimate is not zero

First, we need the critical value: a = 0.05, df ₁ = 1, and df ₂ =58

In Excel, =finv(0.05,1,58)

F _c = 4.00

Excel calculates the ANOVA

ANOVA
	df	SS	MS	F	Significance F
Regression	1	33.06087	33.06087	6.489695	0.013524
Residual	58	295.4732	5.094365
Total	59	328.534

Calculate the F-value =

Equation 31

The computed F exceeds the F _c, so reject the H ₀, and conclude at least one parameter is not equal to zero.
How many observations? 10
How many parameters, k? 4

Degrees of freedom for error df = 10 – 4 = 6
Degrees of freedom for total df = 10 – 1 = 9

ANOVA
	Df	SS	MS	F	Significance F
Regression	3	5001.859635	1667.287	232.4265	1.35E-06
Residual	6	43.04036468	7.173394
Total	9	5044.9

Trend Regression

Time series – date collected over time
1. Could have patterns over time

Linear Time Series

Trend variables always start at 1
You never put in the year
You can add powers of the trend

An example

A Cubic Time Series Trend

Use adjusted R ² to find stopping point
Choose the Regression with largest adjusted R ²
Note – R ² will always take the largest regression

Exponential Regression

An Exponential Growth Trend

Transform equation until linear in parameters by taking the natural logarithm of both sides

Equation 37