More Notes on Least Squares

Continuing on from last time

This is actually based on a couple of exercises from Chapters 2 and 3 of Hastie et al. “The Elements of Statistical Learning”, which reminded me of my post from yesteryear, which then reminded me that I haven’t blogged in quite a while…

Let $y \in \mathbb{R}^{n}$ be the response to a design matrix $X \in \mathbb{R}^{n \times p}$, where $n$ is the number of observations in the given dataset, and $p$ is the number of covariates. Our linear model is then of the form

$$ \begin{equation} y = X\beta + \epsilon \label{lmeqn} \end{equation} $$

The entries of $\epsilon \in \mathbb{R}^{n}$ are called noise variables, which have zero mean, constant variance $\sigma^2$, and zero correlation with each other. Finally, $\beta \in \mathbb{R}^{p}$ is the vector of coefficients in our model.

Prediction error

Suppose that we have a test set consisting of one observation $(X_0, y_0)$, where $X_0 \in \mathbb{R}^{p}$ and $y_0 \in \mathbb{R}$, and we are interested in computing the expected error in the prediction $\hat{y}_0 \in \mathbb{R}$. Then, it follows that

$$ \begin{align} \mathbb{E}_{y_0 \mid X_0} \left[ MSE(\hat{y}_{0}) \right] &= \mathbb{E}_{y_0 \mid X_0} \left[ \mathbb{E}_{X} \left[ (y_0 - \hat{y}_0)^2 \right] \right] \nonumber \newline &= \mathbb{E}_{X} \left[ \mathbb{E}_{y_0 \mid X_0} \left[ (y_0 - \hat{y}_0)^2 \mid X = X_0 \right] \right] \nonumber \newline &= \mathbb{E}_{X} \left[ Var(y_0 - \hat{y}_0 \mid X = X_0) + \left( \mathbb{E}_{y_0 \mid X_0} \left[ y_0 - \hat{y}_0 \mid X = X_0 \right] \right)^2 \right] \nonumber \newline &= Var(\hat{y}_0 \mid X = X_0) + \mathbb{E}_{X} \left[ (y_0 - \hat{y}_0)^2 \right] \nonumber \newline &= Var(\hat{y}_0 \mid X = X_0) + Var(y_0 - \hat{y}_0) + \left( \mathbb{E}_{X} \left[ y_0 - \hat{y}_0 \right] \right)^2 \nonumber \newline &= Var(\hat{y}_0 \mid X = X_0) + Var(\hat{y}_0) \nonumber \newline &= \sigma^2 + Var\left(y^{\intercal} X (X^\intercal X)^{-1} X_0 \right) \nonumber \newline &= \sigma^2 + X_0^{\intercal} (X^\intercal X)^{-1} X^\intercal \cdot Var(y) \cdot X (X^\intercal X)^{-1} X_0 \nonumber \newline &= \sigma^2 \left(1 + X_0^{\intercal} (X^\intercal X)^{-1} X_0 \right) \label{msehaty} \end{align} $$

In my last post, I presented this result about the variance of $\hat{y}$

$$ \begin{align} Var(\hat{y}) &= Var(X \hat{\beta}) \nonumber \newline &= Var(X (X^{\intercal} X)^{-1} X^{\intercal} y) \nonumber \newline &= X (X^{\intercal} X)^{-1} X^{\intercal}) \cdot Var(y) \cdot X (X^{\intercal} X)^{-1} X^{\intercal} \nonumber \newline &= \sigma^2 X (X^{\intercal} X)^{-1} X^{\intercal} \label{varhaty} \end{align} $$

Recall that $\hat{y}$ is unbiased since $\mathbb{E}\left[\hat{y} \right] = \mathbb{E}\left[ y \right]$, and $MSE(\hat{y}) = Var(\hat{y})$. Thus, comparing $\eqref{msehaty}$ and $\eqref{varhaty}$, we see that we have incurred an additional $\sigma^2$ term in the error when $(X_0, y_0) \notin \{ (X,y) \}$. This result actually isn’t that surprising – the expected error on a test set will always be greater than the error on the training set.

The benefits of more data

If $X^\intercal X \rightarrow n Var(X)$, continuing from $\eqref{msehaty}$, we also have

$$ \begin{align} \mathbb{E}_{X_0} \left[ \mathbb{E}_{y_0 \mid X_0}\left[ MSE(\hat{y}_{0}) \right] \right] &\rightarrow \frac{\sigma^2}{n} \mathbb{E}_{X_0} \left[ X_0^{\intercal} (Var(X))^{-1} X_0 \right] + \sigma^2 \nonumber \newline &= \frac{\sigma^2}{n} \mathbb{E}_{X_0} \left[ \mathrm{Tr} \left( X_0^{\intercal} (Var(X))^{-1} X_0 \right) \right] + \sigma^2 \label{trace} \newline &= \frac{\sigma^2}{n} \mathbb{E}_{X_0} \left[ \mathrm{Tr} \left( X_0 X_0^{\intercal} (Var(X))^{-1} \right) \right] + \sigma^2 \label{cyclic_trace} \newline &= \frac{\sigma^2}{n} \mathrm{Tr} \left( \mathbb{E}_{X_0} \left[ X_0 X_0^{\intercal} (Var(X))^{-1} \right] \right) + \sigma^2 \label{linear} \newline &= \frac{\sigma^2}{n} \mathrm{Tr} \left( (Var(X))^{-1} Var(x_0) \right) + \sigma^2 \nonumber \newline &= \frac{\sigma^2}{n} \mathrm{Tr} \left( I_{p} \right) + \sigma^2 \nonumber \newline &= \frac{\sigma^2 p}{n} + \sigma^2 \label{exp_err} \end{align} $$

Here, we make use of the fact that $X_0^{\intercal} (Var(X))^{-1} X_0 \in \mathbb{R}$, and thus its trace is equal to itself in $\eqref{trace}$; the cyclic property of the trace in $\eqref{cyclic_trace}$; and the linearity of the trace in $\eqref{linear}$. The result in $\eqref{exp_err}$ shows that the expected error on the test set scales linearly with $\frac{p}{n}$. That is, as long as $p \ll n$ (or $\sigma^2$ is small), the growth in the expected error will be small.

Gauss-Markov

This all quite naturally leads one to recall the Gauss-Markov theorem that underpins least squares: no other unbiased linear estimator of $\beta$ has a variance smaller than the least squares estimator $\hat{\beta}$. To see this, denote such an unbiased linear estimator of $\beta$ by $\tilde{\beta}$, with variance $Var(\tilde{\beta})$. Then by linearity of $\tilde{\beta}$, for some $M \in \mathbb{R}^{p \times n}$, we can express $\tilde{\beta}$ as

$$ \begin{align} \tilde{\beta} &= (X^{\intercal}X)^{-1} X^{\intercal} + M \nonumber \end{align} $$

which leads to

$$ \begin{align} \mathbb{E} [ \tilde{\beta} y] &= \left( (X^{\intercal}X)^{-1} X^{\intercal} + M \right) X \beta \nonumber \newline \implies MX &= 0 \label{tilde_beta_eq} \end{align} $$

since $\tilde{\beta}$ is unbiased. Using $\eqref{tilde_beta_eq}$, comparing $Var(\tilde{\beta})$ with $Var(\hat{\beta})$, we have that

$$ \begin{align} Var(\tilde{\beta}) &= \left((X^{\intercal}X)^{-1} X^{\intercal} + M \right) \cdot Var(y) \cdot \left( X(X^{\intercal}X)^{-1} + M^{\intercal} \right) \nonumber \newline &= \sigma^2 \left( (X^{\intercal}X)^{-1} + MM^{\intercal} \right) \nonumber \newline &= Var(\hat{\beta}) + \sigma^2 MM^{\intercal} \nonumber \end{align} $$

Note that $MM^{\intercal} \succeq 0$, and thus $Var(\tilde{\beta}) \succeq Var(\hat{\beta})$, which proves the result.

comments powered by Disqus