﻿ Statistical Methods - Lecture 12 HcWjnyVHiTd8hN_8STvJ2rWaXvhPz4wXYCNGvD4qDkU

# Correlation and Regression Lecture 10

## Correlation

Correlation – measure of linear association between two variables

1. Pearson’s correlation

1. Key word is “linear” 1. cov shows how two variables vary together

2. If X = Y, then 1. var(X) is population variance

2. If we estimate cov(X, Y), var(X), and var(Y) by population formulas, then 1. Correlation ranges [-1, +1]

1. If r = -1, then 1. If r = +1, then 1. If r = 0, then 1. Note – exponential function would have a high correlation, but it is not linear 1. First step

1. Construct a scatter diagram of the data

2. What does the data look like?

2. Define

1. X is the independent variable

2. Y is the dependent variable

3. Define the function, Y = f(X)

1. Y is a function of X

2. Establishes causality

3. Correlation does not establish causality

1. Example – Chest pains and shortness of breath may have a high positive correlation

1. One does not cause the other

2. Clogged arteries can cause these two symptoms

3. Be very careful, correlation does not establish causality 1. Correlation t-statistic

1. Assumptions

1. Both variables are normally distributed

2. Both have a linear relationship

2. Hypothesis test 1. The t-test is: 1. Notation

1. r is the correlation coefficient

2. n is the number of observations

3. df = n – 2

4. The 2 is because correlation involves two variables

1. Spearman Rank Correlation

1. Reduces problems with outliers

2. Can handle non-linear relationships, like exponential functions

## Regression Equation

1. You are imposing a relationship onto the variables 1. Yi is the dependent variable

1. Value is obtained from Xi and ui

2. Xi is the independent variable

3. b1 and b2 are parameters

1. These are estimated from the data

4. ui is the random noise term, ui ~ N(0, s2)

1. Notation

1. N is normally distributed

2. Mean is 0

3. Variance is s2

2. Having normally distributed noise allows us to calculate confidence intervals and perform hypothesis testing

3. Estimates of b1 and b2 and predictions of Yi are influenced by the variance of the noise 1. We find the and that minimizes the errors for the data points

1. Some errors are positive

2. Other errors are negative

3. We cannot add the errors because they may cancel

4. We square the errors terms to make them all positive

2. Derivation

Starting with the equation Solve for ui, which yields Square the errors to make them positive This is only for one data point. We want to minimize the total errors of all the data points.
Sum over all the data points
Define Sum of Squared Errors (SSE) We want to find the minimum, thus we take the first partial derivatives with respect to the betas The second step is the Chain Rule from Calculus.  I can put the 2 in front of the summation because each term in the summation has a 2.  Set the partial derivative to zero, in order to find minimum value Now solve equation for b1,   It is debatable when you should add hats to the estimators.  I added at this step when partial was set to zero Summation is a linear operator.  We can apply the summation to all terms in parenthesis
b1 is constant that is summed n times
b2 is constant and multiplied by all X’s in summation
Can bring this to the front The last step works because we substitute the average for y and average for x into the equation.  Repeating these steps to get the estimator for b2 Similarly, set the partial to zero and solve for b2, We substitute the estimator for b1 into the equation, , which yields I did not break the last summation apart.  This is to solve for the estimator for b2 1. This is only good for one X variable. We can generalize least squares to Multiple Regression. There is k parameters to estimate. 1. Example – Demand for Pepsi

1. Q is quantity and P is market price  1. Use least squares to find betas

2. I fitted a line through my data points that gives me the best fit

## Goodness of Fit

1. The goodness-of-fit measure is, R2. 1. If R2 = 0, then the regression equation has no fit
2. If R2 = 1, then a regression equation has a perfect linear fit
3. Also, n = k which is algebraic system
4. Problem – As the number of x variables increases, R2 always gets larger
1. Adjusted R2 - Penalize the goodness of fit if more variables are added .

1. As the number of independent variables increase, the penalty increases, but the error could decrease if new variables explain ‘y’ better.
2. Sometimes can be negative, indicating a very poor fit
3. Note – Very important; it has to be the same y variable
4. One model it cannot be y and in another it is ln (y)

## Analysis of Variance )ANOVA)

In terms of regressions, ANOVA is used to test hypothesis in many types of statistical analysis
Sum of Squared Total (SST) is defined as: .

Yi is the dependent variable in the regression.  The is the total variation for observation i.
Sum of Squared Regression (SSR) is defined as: .

This is the variation explained by the regression
Sum of Squared Errors (SSE), which was earlier defined as: SSE is the amount of variation not explained by the regression equation.  Thus, SST = SSR + SSE, which is proved in the lecture.
We can use this information to calculate the R2 statistic, showing the relationship: Problem – the more parameters added to the regression, the higher the R2.

R2 = 1, if n = k, the number of parameters equal observations

Now we need the degrees of freedom for each measure:

Sum of Squared Regression (SSR)      df =k – 1

Sum of Squared Errors (SSE)             df = n – k

Sum of Squared Total (SST)               df = n – 1

We calculate the Mean Square (MS)

Regression (MS) = SSR / (k – 1)

Residual (MS) =SSE / (n – k)

Total (MS) NA

• When you have a variable with a normal distribution

• If you add or subtract if from other variables with a normal distribution, then it is still normally distributed

• Calculating a mean is a first moment

• If you square a random variable with a normal distribution, then you get a chi-square distribution with degrees of freedom.

• The squares are variances and called the second moment

• All the Mean Squares are distributed as chi squares

• AF-distribution – can test a whole group of hypothesis or test a whole regression model

• F- test can test many other things

• The F-test is a ratio of two chi-squares

• The F-test is a one-tailed test associated with the right-hand tail.

• Squaring makes all terms positive

• The F-distribution and test is as follows: The hypothesis test

H0: Regression model does not explain the data, i.e. all the parameters estimates are zero

Ha: Regression model does explain the model, i.e. at least one parameter estimate is not zero

First, we need the critical value: a = 0.05, df1 = 1, and df2 =58

In Excel, =finv(0.05,1,58)

Fc = 4.00

Excel calculates the ANOVA

 ANOVA df SS MS F Significance F Regression 1 33.06087 33.06087 6.489695 0.013524 Residual 58 295.4732 5.094365 Total 59 328.534

Calculate the F-value = The computed F exceeds the Fc, so reject the H0, and conclude at least one parameter is not equal to zero.
How many observations? 10
How many parameters, k? 4

Degrees of freedom for error df = 10 – 4 = 6
Degrees of freedom for total df = 10 – 1 = 9

 ANOVA Df SS MS F Significance F Regression 3 5001.859635 1667.287 232.4265 1.35E-06 Residual 6 43.04036468 7.173394 Total 9 5044.9

## Trend Regression

1. Time series – date collected over time

1. Could have patterns over time  1. Trend variables always start at 1

2. You never put in the year

3. You can add powers of the trend

1. An example  1. Use adjusted R2 to find stopping point

2. Choose the Regression with largest adjusted R2

3. Note – R2 will always take the largest regression

1. Exponential Regression  1. Transform equation until linear in parameters by taking the natural logarithm of both sides 