Linear Regression¶
1. Regression¶
The correlation coefficient¶
$$ \rho=\frac{1}{n}\sum_{i=1}^{n}(\frac{x_i-\mu_x}{\sigma_x})(\frac{y_i-\mu_y}{\sigma_y}) $$ \(\mu_x\), \(\mu_y\) are the averages of \(x_1,x_2,...,x_n\) and \(y_1,y_2,...,y_n\), and \(\sigma_x\) and \(\sigma_y\) are the standard deviations.
Standardization shows the degree and direction of deviation of the sample from the mean.
var
, sd
, cor
use \(n-1\) as their denominator
Sample correlation is a random variable
because the sample correlation is an average of independent draws, the central limit actually applies. Therefore, for large enough \(N\), the distribution of sample correlation is approximately normal with expected value \(\rho\). The standard deviation, which is somewhat complex to derive, is \(\sqrt{\frac{1-r^2}{N-2}}\).
Correlation is not always a useful summary
Conditional expectations¶
The statistical notation for the conditional expectation is:
However, a challenge when using this approach to estimating conditional expectations is that, for continuous data, we don’t have many data points matching exactly one value in our sample.
A practical way to improve these estimates of the conditional expectations is to define strata of observations with similar value of x. Stratification followed by boxplots lets us see the distribution of each group
The regression line¶
Regression improves precision: We use a Monte Carlo simulation
lower standard error than conditional expectations
Bivariate normal distribution¶
A more technical way to define the bivariate normal distribution is the following: if \(X\) is a normally distributed random variable, \(Y\) is also a normally distributed random variable, and the conditional distribution of \(Y\) for any \(X=x\) is approximately normal, then the pair is approximately bivariate normal.
if our data is approximately bivariate, then the conditional expectation, the best prediction of \(Y\) given we know the value of \(X\), is given by the regression line.
There are two regression lines¶
If we reverse the predict order, this is not determined by computing the inverse function of \(E(Y|X=x)\). We need to compute \(E(X|Y=y)\)
Linear models¶
For the interpretability of intercept \(\beta_0\), we can rewrite the formula:
Least Squares Estimates¶
Find \(\beta_S\) to minimize the distance of the fitted model to the data: residual sum of squares (RSS).
I R
we use lm
to fit this model.
The most common way we use lm
is by using the character ~
to let lm
know which is the variable we are predicting (left of ~
) and which we are using to predict (right of ~
). The intercept is added automatically to the model that will be fit.
We can use the function summary
to extract more of this information
LSE are random variables
Predicted values are random variables
Diagnostic plots¶
The function plot
applied to an lm
object automatically plots six plots, and the argument which
let’s you specify which you want to see.