Linear Regression by Hand

Univariate Regression

Suppose we have the linear model $y = x \beta + \epsilon$, where $y, x, \epsilon \in \mathbb{R}^{n}$, $\beta \in \mathbb{R}$, and the entries of $\epsilon$ have zero mean, constant variance $\sigma^2$, and are uncorrelated with each other. Then, regressing $y$ on $x$ gives

$$ \begin{align} \hat{\beta} &= \frac{x^{\intercal} y}{x^{\intercal} x} \label{univariate_x} \end{align} $$

Now, suppose we add an intercept $x_0 = \mathbf{1} \in \mathbb{R}^{n}$, and its corresponding coefficient $\beta_0 \in \mathbb{R}$ to the model, so that $y = x_0 \beta_0 + x \beta + \epsilon$. Then, regressing $y$ on $x_0$ and $x$ gives

$$ \begin{align} \hat{\beta} &= \frac{(x - \bar{x}\mathbf{1})^{\intercal} y}{(x - \bar{x}\mathbf{1})^{\intercal} (x - \bar{x}\mathbf{1})} \label{univariate_1x_1} \newline \hat{\beta}_0 \mathbf{1} &= \hat{y} - x \hat{\beta} \nonumber \newline \hat{\beta}_0 \mathbf{1} \cdot \frac{\mathbf{1}}{\mathbf{1}^{\intercal} \mathbf{1}} &= (\hat{y} - x \hat{\beta}) \cdot \frac{\mathbf{1}}{\mathbf{1}^{\intercal} \mathbf{1}} \nonumber \newline \hat{\beta}_0 &= \bar{y} - \bar{x} \hat{\beta} \label{univariate_1x_0} \end{align} $$

where $\bar{x}$ and $\bar{y}$ are the respective sample means of $x$ and $y$.

What are we actually doing here?

Comparing $\eqref{univariate_x}$ and $\eqref{univariate_1x_1}$, we see that the fitted $\beta$ when regressing $y$ on $x_0$ and $x$ is the same fitted $\beta$ when regressing $y$ on $x - \bar{x}\mathbf{1}$. That is, including an intercept in the regression of $y$ on $x$ results in a $\hat{\beta}$ that is equivalent to the $\hat{\beta}$ in a regression of $y$ on a mean-centred $x$ (and we then use this to solve for $\hat{\beta}_0$).

Why?

Suppose we were to regress $x$ on $x_0 = \mathbf{1}$ instead. Then, applying $\eqref{univariate_x}$, we (unsurpringly) have

$$ \begin{align} \hat{\beta}_0 &= \frac{\mathbf{1}^{\intercal} x}{\mathbf{1}^{\intercal} \mathbf{1}} \nonumber \newline &= \bar{x} \label{univariate_1x_0_tmp} \end{align} $$

So, $z_0 = x - \bar{x} \mathbf{1}$ is the residual from regressing $x$ on $\mathbf{1}$, and then $\hat{\beta}$ from $\eqref{univariate_1x_1}$ is the regression of $y$ on $z_0$. Recall that the fitted response of a regression is the projection of the response on to the subspace spanned by the covariates. It follows that the residual is orthogonal to the subspace spanned by the covariates. In the regression of $x$ on $\mathbf{1}$, $z_0$ is therefore orthogonal to $\mathbf{1}$, and in the regression of $y$ on $z_0$, we are actually regressing $y$ on the component of $x$ that is orthogonal to $\mathbf{1}$.

In total, we have shown that the regression of $y$ on $x_0 = \mathbf{1}$ and $x$ can be thought of as first regressing $x$ on $\mathbf{1}$, and then regressing $y$ on the residual $z_0 = x - \bar{x} \mathbf{1}$ to compute $\hat{\beta}$ using $\eqref{univariate_1x_1}$, and then solving for $\hat{\beta}_0$ using $\eqref{univariate_1x_0}$.

But again, why does this work?

Note that $\text{span}\{ \mathbf{1},x \} = \text{span}\{\mathbf{1}, x - \bar{x} \mathbf{1} \}$, so we are regressing $y$ on to the same subspace. If $x$ is not orthogonal to $\mathbf{1}$, then there is some correlation between $x$ and $\mathbf{1}$, and performing the usual least squares regression resolves this correlation. Alternatively, we can regress $y$ on to the same subspace where the vectors are orthogonal to each other, and resolve this correlation ourselves. Once we do this, we can follow this procedure to obtain $\hat{\beta}$, and then back-solve for the remaining coefficients.

Multivariate Regression

Following a similar argument as $\eqref{univariate_1x_1}$ and $\eqref{univariate_1x_0}$, in a two-variable regression of $y = x_1 \beta_1 + x_2 \beta_2 + \epsilon$, where $x_1, x_2 \in \mathbb{R}^{n}$, $\beta_1, \beta_2 \in \mathbb{R}$, and $x_1 \perp x_2$, we have that

$$ \begin{align} \hat{\beta}_2 &= \frac{x^{\intercal}_2 y}{x^{\intercal}_2 x_2} \label{univariate_x_2} \nonumber \newline \hat{\beta}_1 x_1 &= \hat{y} - x_2 \hat{\beta}_2 \nonumber \newline \hat{\beta}_1 x_1 \cdot \frac{x_1}{x_1^{\intercal} x_1} &= (\hat{y} - x_2 \hat{\beta}_2) \cdot \frac{x_1}{x_1^{\intercal} x_1} \nonumber \newline \hat{\beta}_1 &= \frac{x_1^{\intercal} y}{x_1^{\intercal} x_1} \label{univariate_x_1} \nonumber \end{align} $$

These are, of course, the same fitted coefficients if we were to regress $y$ on $x_1$ and $x_2$ separately (with no intercept), as per the result in $\eqref{univariate_x}$. This isn’t all that interesting because we don’t actually expect $x_1 \perp x_2$ in most cases.

Suppose we have our familiar multivariate, least squares model $y = X \beta + \epsilon$, where $X = [\mathbf{1}, x_2, \dots, x_p] \in \mathbb{R}^{n \times p}$ includes a column for the intercept, and $\beta \in \mathbb{R}^{p}$ now. Recall that the procedure outlined in $\eqref{univariate_1x_1}$ and $\eqref{univariate_1x_0_tmp}$ essentially follows

  1. Define $z_1 = x_1 = \mathbf{1}$
  2. Regress $x_2$ on $x_1$ to obtain $\tilde{\beta}_{1, 2} = \frac{x_1^{\intercal} x_2}{x_1^{\intercal} x_1}$, and the residual $z_2 = x_2 - \tilde{\beta}_{1, 2} x_1$
  3. Regress $x_j$ on each of $z_1, z_2, \dots, z_{j-1}$ to obtain $\tilde{\beta}_{k,j} = \frac{z_k^{\intercal} x_j}{z_k^{\intercal} z_k}$ for $k = 1, \dots, j-1$, and the residual $z_j = x_j - \sum_{l=1}^{j-1} \tilde{\beta}_{l,j} z_{l}$ for $j = 3, \dots, p$
  4. Regress $y$ on $z_p$ to compute $\hat{\beta}_p = \frac{z_p^{\intercal} y}{z_p^{\intercal} z_p}$

This is, of course, the famous Gram-Schmidt orthogonalisation procedure. Note that one can rearrange the columns $x_j$ in this procedure to be the “last” vector, and thus we can compute $\hat{\beta}_j$ for any $j = 1, \dots, p$. It follows that $\hat{\beta}_j$ is the regression of $y$ on the residual left over from regressing $x_j$ on each of $x_1, \dots, x_{j-1}, x_{j+1}, \dots, x_{p}$.

Observe that if $x_j$ is highly correlated with other columns $x_k$ for $k \ne j$, then the resulting residual will be small, and the denominator in step 4 (which follows from $\eqref{univariate_x}$) will be very small, which causes $\hat{\beta}_j$ to be highly unstable. This is why we prefer our covariates to be uncorrelated (or near uncorrelated) with each other.

Multivariate Regression in One Gram-Schmidt Pass

Notice that if we want to solve for all $\hat{\beta}_1, \dots, \hat{\beta}_p$ in $\hat{\beta}$, then we could be tempted to apply the Gram-Schmidt procedure for each $j \in \{1, \dots, p \}$, but this would require solving many regression problems. We alluded to back-solving for the remaining coefficients earlier, where we first solved for $\hat{\beta}$ and then computed $\hat{\beta}_0$ in the univariate setting. Extending this to the multivariate setting would reduce the need for multiple Gram-Schmidt passes. Specifically, we want to solve for $\hat{\beta}_p$ using Gram-Schmidt, and then back-solve for $\hat{\beta}_1, \dots, \hat{\beta}_{p-1}$.

Observe that step 3 of the Gram-Schmidt procedure can be rewritten as

$$ \begin{align} X &= Z \Gamma \nonumber \end{align} $$

where $Z = [z_1, \dots, z_p] \in \mathbb{R}^{n \times p}$, and $\Gamma \in \mathbb{R}^{p \times p}$ is the upper triangular matrix with $\Gamma_{k,j} = \tilde{\beta}_{k,j}$ for $k \ne j$ and $\Gamma_{k,k} = 1$ for $k = 1, \dots, p$. Normalising this with the diagonal matrix $D \in \mathbb{R}^{p \times p}$, where $D_{j,j} = \Vert z_j \Vert$ for $j = 1, \dots, p$ gives a QR decomposition of $X$ that follows

$$ \begin{align} X &= (Z D^{-1}) (D \Gamma) \nonumber \newline &= Q R \label{qr_decomposition} \end{align} $$

where $Q \in \mathbb{R}^{n \times p}$ is a orthonormal matrix, and $R \in \mathbb{R}^{p \times p}$ is upper triangular. Recall that the least squares solution to $\hat{\beta}$ is given by

$$ \begin{align} (X^{\intercal}X) \hat{\beta} &= X^{\intercal} y \label{least_squares} \end{align} $$

and substituting $\eqref{qr_decomposition}$ into $\eqref{least_squares}$ gives

$$ \begin{align} R^{\intercal} Q^{\intercal} Q R \hat{\beta} &= R^{\intercal} Q^{\intercal} y \nonumber \newline R \hat{\beta} &= Q^{\intercal} y \label{least_squares_qr} \end{align} $$

Now, the last entry of each side of $\eqref{least_squares_qr}$ is

$$ \begin{align} \Vert z_p \Vert \hat{\beta}_p &= \frac{z_p^{\intercal} y}{\Vert z_p \Vert} \nonumber \newline \hat{\beta}_p &= \frac{z_p^{\intercal} y}{z_p^{\intercal} z_p} \nonumber \end{align} $$

which is the result from step 4 of Gram-Schmidt. Back-solving for entry $p-1$ can be done via

$$ \begin{align} R_{p-1, p-1} \hat{\beta}_{p-1} + R_{p-1, p} \hat{\beta}_{p} &= \frac{z_{p-1}^{\intercal} y}{\Vert z_{p-1} \Vert} \nonumber \newline \Vert z_{p-1} \Vert \hat{\beta}_{p-1} + \Vert z_{p-1} \Vert \frac{z_{p-1}^{\intercal} z_p}{z_{p-1}^{\intercal} z_{p-1}} \hat{\beta}_{p} &= \frac{z_{p-1}^{\intercal} y}{\Vert z_{p-1} \Vert} \nonumber \newline \hat{\beta}_{p-1} &= \frac{z_{p-1}^{\intercal} y}{z_{p-1}^{\intercal} z_{p-1}} - \frac{z_{p-1}^{\intercal} z_p}{z_{p-1}^{\intercal} z_{p-1}} \hat{\beta}_{p} \nonumber \end{align} $$

Likewise, we can back-solve for all the other remaining entries of $\hat{\beta}$.

A Python Implementation of Gram-Schmidt

Here is a simple implementation of the Gram-Schmidt procedure we outlined above. The main function is gram_schmidt, which utilises the helper functions compute_gamma_and_residuals, compute_gamma_val, and compute_D.

We test this on a small toy example, and compare how well our implementation does to the LinearRegression implementation in sklearn. The actual $\beta$ used to generate the data is $\beta_0 = 0.5$, $\beta_1 = 2.5$, $\beta_2 = -1.44$, and you can see that the fitted coefficients from LinearRegression are very close to the true coefficients. Our implementation isn’t too far off either!

import numpy as np
from sklearn.linear_model import LinearRegression

n = 120
beta = np.array([0.5,2.5,-1.44])
p = np.shape(beta)[0]

mu = 0
sigma = 0.098765
seed = 4321
rng = np.random.default_rng(seed)
epsilon = rng.normal(mu, sigma, n)

x1 = np.ones((n,1))
x2 = rng.random((n,1))
x3 = np.repeat([1,2,3,4,5], n/5)[:,np.newaxis]
x = np.concatenate((x1,x2,x3), axis=1)
y = x @ beta + epsilon

reg = LinearRegression().fit(x,y)
true_beta_hat = np.append(reg.intercept_,reg.coef_[1:p])

def compute_gamma_val(regressor, regressand):
    return (regressor.T @ regressand) / (regressor.T @ regressor)

def compute_gamma_and_residuals(x):
    n,p = np.shape(x)
    gamma = np.diag(np.ones(p))
    Z = np.empty((n,p))
    Z[:,0] = np.ones(n)
    for j_idx in range(1,p):
        for k_idx in range(p):
            if k_idx < j_idx:
                gamma[k_idx, j_idx] = compute_gamma_val(Z[:,k_idx], x[:,j_idx])
            else:
                pass
    Z[:,j_idx] = x[:,j_idx] - sum([gamma[i_idx,j_idx] * Z[:,i_idx] for i_idx in range(j_idx)])
    return (gamma, Z)

def compute_D(residuals):
    _,p = np.shape(residuals)
    norms = np.array([np.linalg.norm(residuals[:,col_idx]) for col_idx in range(p)])
    D = np.diag(norms)
    return D

def gram_schmidt(x,y):
    gamma, Z = compute_gamma_and_residuals(x)
    D = compute_D(Z)
    Q = Z @ np.linalg.inv(D)
    R = D @ gamma
    beta_hat = np.linalg.solve(R,Q.T @ y)
    return beta_hat

print("Beta hat from sklearn: ", true_beta_hat) # this prints "Beta hat from sklearn:  [ 0.49597804  2.48354263 -1.432845]"
print("Beta hat from our G-S procedure:", gram_schmidt(x,y)) # this prints "Beta hat from our G-S procedure: [ 0.42041058  2.46752273 -1.40497118]"
comments powered by Disqus