1 regression

1.1 bivariate data

Data obtained when observing two characteristics of one object

(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)

1.2 Sample covariance

Linear relationship

s_{xy} = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y) = \frac{1}{n-1}(\sum_{i=1}^nx_iy_i - n\bar x\bar y)

1.3 Sample correlation

r_{xy} = \frac{s_{xy}}{s_xs_y}

Note that -1 \le r_{xy} \le 1

1.4 Spearman correlation coefficient

Measure of monotonicity

\rho_{xy} = \frac{s_{r(x)r(y)}}{s_{r(x)}s_{r(y)}}

Where r(x) denotes the rank of x

If there are no ties:

\rho_{xy} = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2 - 1)}

Where d_i = r(x_i) - r(y_i)

1.5 relationships

1.5.1 deterministic

y = f(x)

1.5.2 nondeterministic

y = f(x) + \varepsilon

1.6 linear regression

Y_i = \beta_0 + \beta_1x_i + \varepsilon_i

Regression line \hat y = \hat \beta_0 + \hat \beta_1 x can be estimated using least squares. These estimators are:

Unbiased estimator for \sigma^2: s_R^2 = \frac{1}{n-2} \sum_{i=1}^n(y_i - \hat y_i)^2

1.6.1 parameters

\hat \beta_0 \sim N(\beta_0, \sqrt{\sigma^2(\frac{1}{n} + \frac{\bar x^2}{(n-1)s_X^2})})

\frac{\hat \beta_0 - \beta_0}{\sqrt{s_R^2(\frac{1}{n} + \frac{\bar x^2}{(n-1)s_X^2})}} \sim t_{n-2}

\hat \beta_1 \sim N(\beta_1, \sqrt{\frac{\sigma^2}{(n-1)s_X^2}})

\frac{\hat \beta_1 - \beta_1}{\sqrt{\frac{s_R^2}{(n-1)s_X^2}}} \sim t_{n-2}

1.7 coefficient of determination

Goodness of fit

R^2 = r^2_{(x,y)}

is the percentage of the sample variability in the y variable is explained by the model (linear dependence on x). Higher the better.

1.7.1 variability decomposition

\begin{aligned} \text{Total sum of squares (SST)} &= \text{Residual sum of squares (SSR)} &&+ \text{Model sum of squares (SSM)} \\ \sum_i (y_i - \bar y)^2 &= \sum_i (y_i - \hat y)^2 &&+ \sum_i (\hat y - \bar y)^2 \\ \end{aligned}

then

R^2 = 1 - \frac{\text{SSR}}{\text{SST}} = \frac{\text{SSM}}{\text{SST}}