on
Notes on Least Squares
Sometimes it helps to have the math in front of you
Pretty much all the courses in statistics I took this quarter emphasised mathematical intuition over raw application and manipulation of mathematics, which is quite different to the courses I’ve taken in the past. The point was that with enough understanding of the underlying principles, it is possible to make conclusions without having to explicitly prove a result. Nevertheless, sometimes it helps to have the math in front of you – on one hand, it reinforces the ideas you are learning about, but also, you may not be convinced of a result unless you can also write it down.
Here, we will look into some of the foundational results of the linear least squares model. Some parts will have less detail than others, but hopefully the contents that we end up covering will be sufficient as a standalone post for those that already have some familiarity with the topic.
Preliminaries
Let $y \in \mathbb{R}^{n}$ be the response to a design matrix $X \in \mathbb{R}^{n \times p}$, where $n$ is the number of observations in the given dataset, and $p$ is the number of covariates to the response. Our linear model is then of the form
$$ \begin{equation} y = X\beta + \epsilon \label{lmeqn} \end{equation} $$
For $i = 1, \dots, n$, let $y_i$, $X_i$, and $\epsilon_i$ be the $i^{th}$ observation. Then, we can also write $\eqref{lmeqn}$ as
$$ \begin{equation} y_i = X_i \beta + \epsilon_i \label{lmeqnrow} \end{equation} $$
The entries of $\epsilon \in \mathbb{R}^{n}$ are called noise variables, which have zero mean, constant variance $\sigma^2$, and zero correlation with each other. For $j \ne i$, these properties can be written as
$$ \begin{align} \mathbb{E}[\epsilon_i] &= 0 \label{eeps} \newline Var(\epsilon_i) &= \sigma^2 \label{vareps} \newline Cov(\epsilon_i, \epsilon_j) &= 0 \label{coveps} \end{align} $$
These are known as the assumptions of the linear model – sometimes an additional normality property is also assumed, however for our purposes, this will not be necessary. Finally, let $\beta \in \mathbb{R}^{p}$ be the vector of coefficients in our model, whose entries are unknown values. We will first look into using the known quantities of $X$ and $y$ to solve for $\beta$.
The least squares solution
The least squares solution $\hat{\beta}$ to our linear model is the solution to the following minimisation problem
$$ \begin{equation} \hat{\beta} = \underset{\beta}{\mathrm{argmin}} \frac{1}{2} \Vert y - X \beta \Vert_{2}^{2} \label{lsprb} \end{equation} $$
Here, $\Vert \cdot \Vert_{2}$ is the $\ell_2$ norm. It follows that
$$ \begin{align*} \underset{\beta}{\mathrm{argmin}} \frac{1}{2} \Vert y - X \beta \Vert_{2}^{2} &= \underset{\beta}{\mathrm{argmin}} \frac{1}{2} (y - X \beta)^{\intercal} (y - X \beta) \newline &= \underset{\beta}{\mathrm{argmin}} \frac{1}{2} \left( y^{\intercal} y - y^{\intercal} X \beta - \beta^{\intercal} X^{\intercal} y - \beta^{\intercal} X^{\intercal} X \beta \right) \newline &= \underset{\beta}{\mathrm{argmin}} \frac{1}{2} \left( y^{\intercal} y - 2 \beta^{\intercal} X^{\intercal} y - \beta^{\intercal} X^{\intercal} X \beta \right) \end{align*} $$
Then, taking the derivative with respect to $\beta$ of the above expression, and setting this to zero, we have
$$ \begin{align} & \frac{\partial}{\partial \beta} \frac{1}{2} \left( y^{\intercal} y - 2 \beta^{\intercal} X^{\intercal} y - \beta^{\intercal} X^{\intercal} X \beta \right) = X^{\intercal} y - X^{\intercal} X \beta \nonumber \newline \implies \qquad & X^{\intercal} y - X^{\intercal} X \hat{\beta} = 0 \nonumber \newline \qquad & \hat{\beta} = (X^{\intercal} X)^{-1} X^{\intercal} y \label{lssol} \end{align} $$
For completeness, we should also check that $\hat{\beta}$ is indeed a minimiser of $\eqref{lsprb}$. One way is to check the second derivative of $\frac{1}{2} \Vert y - X \beta \Vert_{2}^{2}$ with respect to $\beta$, but a more straightforward way is to note that all norms are convex functions, and thus any point where the gradient is zero is a global minimum by first order conditions. Now that we have the least squares solution $\hat{\beta}$, our vector of fitted response values $\hat{y}$ can be expressed as
$$ \begin{align} \hat{y} &= X \hat{\beta} \label{lseqn} \newline &= X (X^{\intercal}X)^{-1} X^{\intercal} y \label{lseqnfull} \end{align} $$
For $i = 1, \dots, n$, we can also write this as
$$ \begin{align} \hat{y}_i &= X_i \hat{\beta} \label{lseqnrow} \newline &= X_i (X^{\intercal}X)^{-1} X^{\intercal} y \label{lseqnrowfull} \end{align} $$
What’s random and what isn’t
It’s a good idea to revisit which quantities are random, and which are not, before we go into the properties of the least squares solution. Recall that $X$ is a fixed and known quantity, and $\beta$ is a fixed but unknown quantity. Thus, it follows that
$$ \begin{align} \mathbb{E}[X] &= X \label{ex} \newline \mathbb{E}[\beta] &= \beta \label{ebeta} \newline Var(X) = Var(\beta) &= 0 \label{var0} \newline \end{align} $$
What about $y$? Let $I$ be the identity matrix of size $n \times n$. Then, following from $\eqref{lmeqn}$, and the results from $\eqref{eeps}$, $\eqref{vareps}$, $\eqref{ex}$, $\eqref{ebeta}$, and $\eqref{var0}$, we have
$$ \begin{align} \mathbb{E}[y] &= \mathbb{E}[X \beta + \epsilon] \nonumber \newline &= X \beta + \mathbb{E}[\epsilon] \nonumber \newline &= X \beta \label{ey} \newline Var(y) &= Var(X \beta + \epsilon) \nonumber \newline &= Var(\epsilon) \nonumber \newline &= \sigma^2 \cdot I \label{vary} \end{align} $$
The notation is a bit liberal here. In particular, note that the expectation of $\epsilon$ and the variance of $\epsilon$ are actually operations on each entry of $\epsilon$. More precisely, the expectation of $\epsilon$ is a zero vector of length $n$, and the variance of $\epsilon$ is a covariance matrix $\Sigma = \sigma^2 \cdot I$ of size $n \times n$ with diagonal entries equal to $\sigma^2$ and zero everywhere else. We can also see that the variance in $y$ arises purely from $\epsilon$, which is a random and unknown quantity. This is somewhat unexpected, since $y$ is a known and fixed quantity to us, but one way to reason with this is that although $y$ is fixed, it also has some component of randomness that is outside of our linear model.
Let’s turn our attention to the $\hat{\beta}$ and $\hat{y}$ estimates. $\eqref{lseqn}$ tells us that $\hat{y}$ depends on $\hat{\beta}$, and $\eqref{lseqnfull}$ tells us that $\hat{\beta}$ depends on $y$. It follows that $\hat{y}$ and $\hat{\beta}$ are random quantities, so we will look at the expectations and variances of these next.
Quality of least squares estimates
The estimates $\hat{\beta}$ and $\hat{y}$ are important because these results reflect the quality of our linear model. Using $\eqref{lssol}$, $\eqref{ex}$, and $\eqref{ey}$, we have
$$ \begin{align} \mathbb{E}[\hat{\beta}] &= \mathbb{E}[(X^{\intercal} X)^{-1} X^{\intercal} y] \nonumber \newline &= (X^{\intercal} X)^{-1} X^{\intercal} \mathbb{E}[y] \nonumber \newline &= (X^{\intercal} X)^{-1} X^{\intercal} X \beta \nonumber \newline &= \beta \label{ehatbeta} \newline \mathbb{E}[\hat{y}] &= \mathbb{E}[X \hat{\beta}] \nonumber \newline &= X \beta \nonumber \newline &= \mathbb{E}[y] \label{ehaty} \end{align} $$
We have now shown that $\hat{\beta}$ is an unbiased estimator of $\beta$, and that the expected value of $\hat{y}$ is equal to the expected value of $y$. Furthermore,
$$ \begin{align} Var(\hat{\beta}) &= Var(X^{\intercal} X)^{-1} X^{\intercal} y) \nonumber \newline &= (X^{\intercal} X)^{-1} X^{\intercal} \cdot Var(y) \cdot X (X^{\intercal} X)^{-1} \nonumber \newline &= \sigma^2 (X^{\intercal}X)^{-1} \label{varhatbeta} \newline Var(\hat{y}) &= Var(X \hat{\beta}) \nonumber \newline &= Var(X (X^{\intercal} X)^{-1} X^{\intercal} y) \nonumber \newline &= X (X^{\intercal} X)^{-1} X^{\intercal} y \cdot Var(y) \cdot X (X^{\intercal} X)^{-1} X^{\intercal} \nonumber \newline &= \sigma^2 X (X^{\intercal} X)^{-1} X^{\intercal} \label{varhaty} \end{align} $$
Observe that the variances of $\hat{\beta}$ and $\hat{y}$ reveal a number of interesting properties regarding linear models:
- The variances of both our least squares solution and our fitted response are positively associated with $\sigma^2$. That is, if $\sigma^2$ is large, the variances are also large. Intuitively, if the variance of our noise variables are large, $y$ has a larger component of randomness, and our ability to precisely estimate the underlying $\beta$ coefficients become more difficult – imagine a scenario where we had zero noise (or noise with zero variance) so that our response could be perfectly explained by $X \beta$; here $\hat{\beta}$ would equal $\beta$, and as we introduce increasingly larger non-constant noise into our model, our ability to estimate $\beta$ deteriorates. It follows that our fitted response values $\hat{y}$ suffer from this increased variability in $\hat{\beta}$
- The variances are inversely associated with $X^{\intercal} X$. It’s worth looking at $X^{\intercal} X$ more closely here. Firstly, recall that $X \in \mathbb{R}^{n \times p}$, and thus $X^{\intercal} X \in \mathbb{R}^{p \times p}$. Now, imagine we have a fixed $X$ and $X^{\intercal} X$ of size $n \times p$ and $p \times p$ respectively. Consider the scenario where a new covariate is added to $X$ so that our new $\tilde{X}$ is of size $n \times (p+1)$, and $\tilde{X}^{\intercal} \tilde{X}$ is of size $(p+1) \times (p+1)$. Then, let $\tilde{\beta}$ be the least squares solution when $\tilde{X}$ is the design matrix. One result you should convince yourself of is that the variance of $\tilde{\beta}$ is (almost always) larger than the variance of $\hat{\beta}$. That is, the more covariates that are added to $X$, the more variance we can expect in our least squares solution. We will skip the technical details, but the underlying principles that make this result true relate to positive semi-definite orderings, and the fact that we can show $(\tilde{X}^{\intercal}\tilde{X})^{-1} - (X^{\intercal}X)^{-1}$ is a positive semi-definite matrix. Intuitively, every time we add a covariate to the design matrix, we increase the chances of getting a more complex least squares solution, and more complex models tend to have higher variances than simpler ones
The quantity we really want to understand though is not actually the least squares solution or its implied fitted response, but rather the difference between the fitted response and the true response $y$. This is called the vector of residuals $\hat{\epsilon} \in \mathbb{R}^{n}$ of the model, and is formalised as
$$ \begin{align} \hat{\epsilon} &= y - \hat{y} \nonumber \newline &= y - X \hat{\beta} \label{reseqn} \end{align} $$
Analysis of residuals
Recall that $\epsilon$ is the difference between $y$ and $X \beta$, so another way of interpreting $\epsilon$ is that it is the unobserved error in the linear model. Using $\hat{\epsilon}$ to denote the “error” between $y$ and $X \hat{\beta}$ then seems quite natural. Let’s look at the expectation and variance of $\hat{\epsilon}$. Then, using $\eqref{vary}$, $\eqref{ehatbeta}$, $\eqref{ehaty}$, and $\eqref{reseqn}$, we have
$$ \begin{align} \mathbb{E}[\hat{\epsilon}] &= \mathbb{E}[y - X \hat{\beta}] \nonumber \newline &= X \beta - X \mathbb{E}[\hat{\beta}] \nonumber \newline &= X \beta - X \beta \nonumber \newline &= 0 \label{ehateps} \newline Var(\hat{\epsilon}) &= Var(y - X \hat{\beta}) \nonumber \newline &= Var(y - X (X^{\intercal}X)^{-1} X^{\intercal} y) \nonumber \newline &= (I - X(X^{\intercal}X)^{-1} X^{\intercal}) \cdot Var(y) \cdot (I - X(X^{\intercal}X)^{-1} X^{\intercal}) \nonumber \newline &= \sigma^2 (I - X(X^{\intercal}X)^{-1} X^{\intercal}) \label{varhateps} \end{align} $$
Here, we make use of the property that $H = X(X^{\intercal}X)^{-1} X^{\intercal}$ is idempotent so that $H \cdot H = H$. We have now shown that the expected value of $\hat{\epsilon}$ is equal to zero, and that its variance, expectedly, scales with $\sigma^2$. What about the $I - H$ component of the variance? Firstly, note that $H \in \mathbb{R}^{n \times n}$, and observe that the diagonal entries of $H$ are bounded between 0 and 1 – this is a direct consequence of the fact that $H$ is a symmetric and idempotent matrix. Then, observations that have high corresponding values on this diagonal will tend to have smaller residual variances. That is, the linear model will favour fitting these observations better than others, which can lead to unnatural results, and is thus one of the main topics of interest in model diagnostics.
So far we have only looked at the behaviour of residuals without considering the relationships between residuals and other estimated quantities. We now turn to look at the covariance between the residuals and the least squares solution, and the covariance between the residuals and the fitted response. Using $\eqref{lssol}$, $\eqref{vary}$, and $\eqref{varhatbeta}$, we have
$$ \begin{align} Cov(\hat{\beta}, \hat{\epsilon}) &= Cov(\hat{\beta}, y - X \hat{\beta}) \nonumber \newline &= Cov(\hat{\beta},y) - Cov(\hat{\beta}, X \hat{\beta}) \nonumber \newline &= X (X^{\intercal}X)^{-1} \cdot Var(y) - X \cdot Var(\hat{\beta}) \nonumber \newline &= \sigma^2 X (X^{\intercal}X)^{-1} - \sigma^2 X (X^{\intercal}X)^{-1} \nonumber \newline &= 0 \label{covhatbetaeps} \newline Cov(\hat{y}, \hat{\epsilon}) &= Cov(X \hat{\beta}, y - X \hat{\beta}) \nonumber \newline &= Cov(X \hat{\beta}, y) - Var(X \hat{\beta}) \nonumber \newline &= X (X^{\intercal}X)^{-1} X^{\intercal} \cdot Var(y) - X \cdot Var(\hat{\beta}) \cdot X^{\intercal} \nonumber \newline &= \sigma^2 X (X^{\intercal}X)^{-1} X^{\intercal} - \sigma^2 X (X^{\intercal}X)^{-1} X^{\intercal} \nonumber \newline &= 0 \label{covhatyeps} \end{align} $$
We have just proven an important, yet expected, result: the covariance between the residuals and least squares estimates are always zero. Why is this important? We have just shown that least squares guarantees us that there is no linear association between the residuals and our fitted response $\hat{y}$ (via $\hat{\beta}$). In other words, residuals retain no linearity with $y$. Let’s derive one additional result to reiterate this point.
$$ \begin{align} Cov(\hat{\beta}, \hat{y}) &= Cov(\hat{\beta}, X \hat{\beta}) \nonumber \newline &= X^{\intercal} \cdot Var(\hat{\beta}) \nonumber \newline &= \sigma^2 X^{\intercal} (X^{\intercal}X)^{-1} \label{covhatbetay} \end{align} $$
Here, we can see that the covariance between $\hat{\beta}$ and $\hat{y}$ is non-zero as long as we aren’t dealling with trivial cases. Combining this with the results above, we can see that our least squares solution $\hat{\beta}$ will always account for all the linear association with $y$.
Wrapping up
There’s a lot of math you can write out if you want to see it written down, and we only saw a small subset of that here. It’s also not a bad thing even if you have the intuition for how things should be. Shout out if you notice any mistakes!
comments powered by Disqus