Analyzing the Regression Line
The correlation provides us with an estimate of how linear the data is. We would also like to know how close the data are to the regression line. We use a measurement se which is a point estimate for the standard deviation for the residuals. If se is large then the points lie far from the line and if it is small then the points are close to the line.
We have an empirical rule that says that:
approximately 95% of the points lie within 2se of the line.
A point estimate for s2 is given by
and the point estimate for s is its square root.
Inferences on the Slope
Suppose that the equation of the regression line calculated from the data is
y = a + bx.
y = a + b x,
is b a good point estimate for b? We can estimate the standard deviation by the formula
The t statistic is
We can form a confidence interval for b as
To interpret this confidence for example we can say that we are 95% confident that the true slope of the regression line is between two and three.
If the slope of the regression line is 0 then the regression line is useless. Hence it is typical to test the hypothesis
Ho: b = 0
Ha: b 0
We use the t statistic
and proceed as usual.
Suppose that we have computed the regression line that corresponds to education (years of college) vs. income as
= 15,000 + 5x
with 200 data points and have
sb = 2
a = .05
Then we have
Ho: b = 0
Ha: b 0
t = 5/2 = 2.5
giving a p-value between .01 and .02. Since p < a we can reject H0 and accept H1 and conclude that the regression line is useful for predicting the income based on college years. We can make a 95% confidence interval for the slope:
Testing if There is a Correlation
We have talked about the correlation being weak, moderate, or strong; however, with a small sample this may not be reliable. Smaller samples can produce unreliable results. Next we will create a hypothesis on whether there is a correlation between the two variables. If there is no correlation then the correlation coefficient will be 0. Otherwise it will not be 0. We can also test to see if there is a positive or negative correlation. As you may guess, the difference in the test for a correlation, a positive correlation, or a negative correlation will be whether we use a two tailed test, a right tailed test, or a left tailed test. We will use the Greek letter "r" pronounced "rho" for the population correlation and r for the sample correlation. The test statistic will be given by
Notice that this is a "t" statistic. We have
degrees of freedom = n - 2
Notice that the larger the sample size (with the same r), the larger the t value. Also, a larger r will produce a larger t value.
A study was done to see if there is a positive correlation between the number of times per month that college students call home and the amount of money that their parents contribute towards their education. 175 students were surveyed and the correlation was found to be 0.18. What can be said at the 0.05 level of significance?
First we write down the null and alternative hypotheses:
H0: r = 0
H1: r > 0
We compute the
Since the sample size is large, we can use the normal distribution (z-table) to approximate the P-value. Notice also that this is a right tailed test so we need to subtract the table value from 1. We have
P = 1 - .9920 = 0.008
Since P is less than 0.05, we can conclude that there is a positive correlation between the number of times per month that students call home and the amount of money that their parents contribute towards their education.
Remark: We were able to conclude that there is strong statistical evidence of a positive correlation. On the other hand the correlation of 0.18 is a weak correlation. Try not to confuse strong evidence to show a correlation with a strong correlation. Also we can not conclude that calling parents frequently will induce parents to send more money. We have established correlation, not causation.
Remark: If the correlation is 0, then so is the slope. It turns out that the test statistic for the slope is the same as the test statistic for the correlation. Computers will usually provide the P-value for testing the slope. This is the same as the P-value for testing the correlation.