[Yet Another] Backpropagation by Hand [Blog Post]

Too short; want more

I’m using Neural Networks and Deep Learning by Aggarwal as a reference, which actually does a very good job of explaining the underlying mathematics already, but I’m compelled to rewrite it with a bit more exposition that fills in some minor details. I’m also dropping the illustrations, which may be helpful if you haven’t seen them before, and can be found in the textbook. Most of the notation is identical to the textbook, although I’ve made some stylistic changes here and there.

The short

Denote the set of hidden layer paths and their outputs in the forward direction by $\mathcal{P}$, where paths along hidden layers of length $l$ can be expressed as a sequence of outputs $(h_1, \dots, h_l)$ that are emitted from these layers, and the output at the end of this sequence is $o$. Furthermore, let the weight between layer $j$ and $j+1$ be $w_{(j,j+1)}$ for $j = 1, \dots, l-1$, and let the associated loss of $o$ be $L$. Specifically, we have

$$ \begin{align} h_{j+1} &= \Phi_{j+1}(a_{j+1}) \nonumber \newline &= \Phi_{j+1}(w_{(j,j+1)} \cdot h_j) \qquad \forall j = 1, \dots, l-1 \label{layer_output} \end{align} $$

where $\Phi_{j+1}$ is the activation function at the $(j+1)^{th}$ hidden layer of the given path, that is applied to the pre-activation value $a_{j+1}$ passed from the $j^{th}$ hidden layer. For the edge cases involving the input $i$ and output $o$, we use a similar convention of

$$ \begin{align} h_1 &= \Phi_1(a_1) = \Phi_1(w_{(0,1)} \cdot i) \nonumber \newline o &= \Phi_o(a_o) = \Phi_o(w_{(l,o)} \cdot h_l) \nonumber \end{align} $$

Now by the chain rule, the gradient of $L$ with respect to $w_{j,j+1}$ along a particular path $(\tilde{h}_{j+1}, \dots, \tilde{h}_l, o)$ can be expanded as

$$ \begin{align} \frac{\delta L}{\delta w_{(j, j+1)}} &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta \tilde{h}_l} \cdot \frac{\delta \tilde{h}_l}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta \tilde{h}_l} \cdot \frac{\delta \tilde{h}_l}{\delta \tilde{h}_{l-1}} \cdot \dots \cdot \frac{\delta \tilde{h}_{j+1}}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta \tilde{h}_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta \tilde{h}_{k+1}}{\delta \tilde{h}_{k}} \right] \cdot \frac{\delta \tilde{h}_{j+1}}{\delta w_{(j, j+1)}} \label{onevar_grad_h} \end{align} $$

However, if there is more than one unique path between $L$ and $w_{(j,j+1)}$, then $\eqref{onevar_grad_h}$ needs to be adjusted using the multivariable chain rule that essentially traverses along all viable paths in $\mathcal{P}$ from $o$ to $h_j$. Using $\eqref{layer_output}$, it follows that

$$ \begin{align} \frac{\delta L}{\delta w_{(j, j+1)}} &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \left[ \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta o}{\delta h_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta h_{k+1}}{\delta h_{k}} \right] \right] \cdot \frac{\delta h_{j+1}}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \left[ \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta o}{\delta h_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta h_{k+1}}{\delta h_{k}} \right] \right] \cdot \left[ h_j \cdot \Phi_{j+1}’(w_{j,j+1} \cdot h_j) \right] \label{multivar_grad_h} \end{align} $$

Furthermore, observe that

$$ \begin{align} & \frac{\delta L}{\delta o} \cdot \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta o}{\delta h_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta h_{k+1}}{\delta h_{k}} \right] \nonumber \newline = \qquad & \frac{\delta L}{\delta h_{j+1}} \nonumber \newline = \qquad & \sum_{j+2: (h_{j+1}, h_{j+2}) \in \mathcal{P}} \frac{\delta L}{\delta h_{j+2}} \cdot \frac{\delta h_{j+2}}{\delta h_{j+1}} \nonumber \newline = \qquad & \sum_{j+2: (h_{j+1}, h_{j+2}) \in \mathcal{P}} \frac{\delta L}{\delta h_{j+2}} \cdot \frac{\delta h_{j+2}}{\delta a_{j+2}} \cdot \frac{\delta a_{j+2}}{\delta h_{j+1}} \nonumber \newline = \qquad & \sum_{j+2: (h_{j+1}, h_{j+2}) \in \mathcal{P}} \frac{\delta L}{\delta h_{j+2}} \cdot \Phi_{j+2}’(a_{j+2}) \cdot w_{(j+1,j+2)} \label{backprop_h} \end{align} $$

Note that $\eqref{backprop_h}$ is a backwards recursive equation for computing $\frac{\delta L}{\delta h_j}$ for all $j = l, \dots, 1$. Together, $\eqref{multivar_grad_h}$ and $\eqref{backprop_h}$ allow us to perform backpropagation for training neural networks.

The short (revisited)

It is also possible to derive a backpropagation algorithm that applies the chain rule using $a_j$, instead of $h_j$, as intermediary variables. In this case, $\eqref{multivar_grad_h}$ becomes

$$ \begin{align} \frac{\delta L}{\delta w_{(j, j+1)}} &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \frac{\delta o}{\delta a_o} \cdot \left[ \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta a_o}{\delta a_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta a_{k+1}}{\delta a_{k}} \right] \right] \cdot \frac{\delta a_{j+1}}{\delta w_{(j, j+1)}} \nonumber \newline &= \frac{\delta L}{\delta o} \cdot \Phi_o’(a_o) \cdot \left[ \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta a_o}{\delta a_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta a_{k+1}}{\delta a_{k}} \right] \right] \cdot h_j \label{multivar_grad_a} \end{align} $$

Similarly, $\eqref{backprop_h}$ becomes

$$ \begin{align} & \frac{\delta L}{\delta o} \cdot \Phi_o’(a_o) \cdot \sum_{(k,l): (h_{j+1}, \dots, h_l, o) \in \mathcal{P}} \frac{\delta a_o}{\delta a_l} \cdot \left[ \prod_{k=j+1}^{l-1} \frac{\delta a_{k+1}}{\delta a_{k}} \right] \nonumber \newline = \qquad & \frac{\delta L}{\delta a_{j+1}} \nonumber \newline = \qquad & \sum_{j+2: (h_{j+1}, h_{j+2}) \in \mathcal{P}} \frac{\delta L}{\delta a_{j+2}} \cdot \frac{\delta a_{j+2}}{\delta a_{j+1}} \nonumber \newline = \qquad & \Phi_{j+1}’(a_{j+1}) \cdot \sum_{j+2: (h_{j+1}, h_{j+2}) \in \mathcal{P}} \frac{\delta L}{\delta a_{j+2}} \cdot w_{(j+1, j+2)} \label{backprop_a} \end{align} $$

which together are equivalent to $\eqref{multivar_grad_h}$ and $\eqref{backprop_h}$.

The really short

Backpropagation is essentially underpinned by a product-wise accumulation of the derivative in later layers combined with the local derivative that consists of the preceding output value, the derivative of the current activation function, and the succeeding weights along all paths from the next layer.

comments powered by Disqus