**Keywords:** backpropagation, BP

## Architecture^{1}

We have seen a three-layer network is flexible in approximating functions. If we had a more-than-three-layer network, it could be used to approximate any functions as close as we want. However, another trouble came to us is how to train these networks. This problem almost killed neural networks in the 1970s. Until backpropagation(BP for short) algorithm was found that it is an efficient algorithm in training multiple layers networks.

3-layer network is also used in this post for it is the simplest multiple-layer network whose abbreviated notation is:

and a more short way to represent its architecture is:

R – S^1 – S^2 – S^3 \tag{1}

$$

For the three-layer network has only three layers which are not too large to denote mathematically, then it can be written as:

\boldsymbol{a}=f^3(W^3\cdot f^2(W^2\cdot f^1(W^1\cdot \boldsymbol{p}+\boldsymbol{b}^1)+\boldsymbol{b}^2)+\boldsymbol{b}^3)\tag{2}

$$

but, it becomes impossible when we have a 10-layer network or 100-layer network. Then we can use some short equations that describe the whole operation of $M$-layer network:

a^{m+1}=f^{m+1}(W^{m+1}\boldsymbol{a}^{m}+\boldsymbol{b}^{m+1})\tag{3}

$$

for $m = 1, 2, 3, \cdots M-1$. $M$ is the number of layers in the neural networks. And:

- $\boldsymbol{a}^0=\boldsymbol{p}$ is its input
- $\boldsymbol{a}=\boldsymbol{a}^M$ is its output

## Performance Index

We have had a network now. And ‘performance learning’ is a kind of learning rule in learning tasks. Then we need a definite performance index for the 3-layer network. MSE is used here as the performance index the same as what the LMS algorithm did in post ‘Widrow-Hoff Learning(Part I)’ and ‘Widrow-Hoff Learning(Part II)’. And a set of examples of proper network behavior are needed:

\{\boldsymbol{p}_1,\boldsymbol{t}_1\},\{\boldsymbol{p}_2,\boldsymbol{t}_2\},\cdots \{\boldsymbol{p}_Q,\boldsymbol{t}_Q\}\tag{4}

$$

where $\boldsymbol{p}_1$ is the input and $\boldsymbol{t}_1$ is the corresponding output.

Recall that in the steepest descent algorithm, we have a definite objective function, minimizing which is the task of the algorithm iteratively. BP is the generation of LMS algorithms, and both of them try to minimize the mean square error. However, the best structure of the network that is the closest to the true data is unknown. So each iteration an approximation of ‘Error’ which was calculated from our observation was used in the algorithm. And what we finally get is a trained neural network that fits the training set but not guarantees fitting the original task. So a good training set which can represent the original task as close as possible is necessary.

To make it easier to understand from the steepest descent algorithm to LMS and BP, we convert the weights and bias in the neural network form $w$ and $b$ into a vector $\boldsymbol{x}$. Then the performance index is:

F(\boldsymbol{x})=\mathbb E[e^2]=\mathbb E[(t-a)^2]\tag{5}

$$

When network has multiple outputs this generalizes to:

F(\boldsymbol{x})=\mathbb E[\boldsymbol{e}^T\boldsymbol{e}]=\mathbb E[(\boldsymbol{t}-\boldsymbol{a})^T(\boldsymbol{t}-\boldsymbol{a})]\tag{6}

$$

During an iteration, in the LMS algorithm, MSE(mean square error) is approximated by SE(square error):

\hat{F}(\boldsymbol{x})=(\boldsymbol{t}-\boldsymbol{a})^T(\boldsymbol{t}-\boldsymbol{a})=\boldsymbol{e}^T\boldsymbol{e}\tag{7}

$$

where the expectations are replaced by the calculation of current input, output, and target.

Reviewing the ‘steepest descent algorithm’, the approximate MSE which is also called stochastic gradient descent:

\begin{aligned}

w^m_{i,j}(k+1)&=w^m_{i,j}(k)-\alpha \frac{\partial \hat{F}}{\partial w^m_{i,j}}\\

b^m_{i}(k+1)&=b^m_{i}(k)-\alpha \frac{\partial \hat{F}}{\partial b^m_{i}}

\end{aligned}\tag{8}

$$

where $\alpha$ is the learning rate.

However, the steep descent algorithm seems can not work on a multiple-layer network for we can not just calculate the partial derivative in the hidden layer and input layer directly. So we should recall another mathematical tool – **chain rule**.

## Chain Rule

### Calculus

when $f$ is explicit function of $\boldsymbol{n}$ and $\boldsymbol{n}$ is a explicit function of $\boldsymbol{w}$, we can calculate the partial derivative $\frac{\partial f}{\partial w}$ by:

\frac{\partial f}{\partial w}=\frac{\partial f}{\partial n}\frac{\partial n}{\partial w}\tag{9}

$$

The whole process looks like a chain. And let’s look at a simple example: when we have $f(n)=e^n$ and $n=2w$, we have $f(n(w))=e^{2w}$. We can easily calculate the direvative $\frac{\partial f}{\partial w}=\frac{\partial e^2w}{\partial w}=2e^{2w}$. And when chain rule is used, we have:

\frac{\partial f(n(w))}{\partial w}=\frac{\partial e^n}{\partial n}\frac{\partial n}{\partial w}=\frac{\partial e^n}{\partial n}\frac{\partial 2w}{\partial w}=e^n\cdot 2=2e^{2w}\tag{10}

$$

that is the same as what we get directly.

When the chain rule is used in the second part on the right of equation (8), we get the way to calculate the derivative of the weight of hidden layers:

\begin{aligned}

\frac{\partial \hat{F}}{\partial w^m_{i,j}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial n^m_i}{\partial w^m_{i,j}}\\

\frac{\partial \hat{F}}{\partial b^m_{i}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial n^m_i}{\partial b^m_{i}}

\end{aligned}\tag{11}

$$

from the abbreviated notation, we know that $n^m_i=\sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i$. Then equation (11) can be writen as:

\begin{aligned}

\frac{\partial \hat{F}}{\partial w^m_{i,j}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial \sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i}{\partial w^m_{i,j}}=\frac{\partial \hat{F}}{\partial n^m_i}\cdot a^{m-1}_j\\

\frac{\partial \hat{F}}{\partial b^m_{i}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial \sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i}{\partial b^m_{i}}=\frac{\partial \hat{F}}{\partial n^m_i}\cdot 1

\end{aligned}\tag{12}

$$

Equation (12) could also be simplified by defining a new concept: sensitivity.

### Sensitivity

We define **sensitivity** as $s^m_i\equiv \frac{\partial \hat{F}}{\partial n^m_{i}}$ that means the sensitivity of $\hat{F}$ to changes in the $i^{\text{th}}$ element of the net input at layer $m$. Then equation (12) can be simplified as:

\begin{aligned}

\frac{\partial \hat{F}}{\partial w^m_{i,j}}&=s^m_{i}\cdot a^{m-1}_j\\

\frac{\partial \hat{F}}{\partial b^m_{i}}&=s^m_{i}\cdot 1 \end{aligned}\tag{13}

$$

Then the steepest descent algorithm is:

\begin{aligned}

w^m_{i,j}(k+1)&=w^m_{i,j}(k)-\alpha s^m_{i}\cdot a^{m-1}_j\\

b^m_{i}(k+1)&=b^m_{i}(k)-\alpha s^m_{i}\cdot 1

\end{aligned}\tag{14}

$$

This can also be written in a matrix form:

\begin{aligned}

W^m(k+1)&=W^m(k)-\alpha \boldsymbol{s}^m(\boldsymbol{a}^{m-1})^T\\

\boldsymbol{b}^m(k+1)&=\boldsymbol{b}^m(k)-\alpha \boldsymbol{s}^m\cdot 1

\end{aligned}\tag{15}

$$

where:

\boldsymbol{s}^m=\frac{\partial \hat{F}}{\alpha \boldsymbol{n}^m}=\begin{bmatrix}

\frac{\partial \hat{F}}{\partial n^m_1}\\

\frac{\partial \hat{F}}{\partial n^m_2}\\

\vdots\\

\frac{\partial \hat{F}}{\partial n^m_{S^m}}\\

\end{bmatrix}\tag{16}

$$

And be careful of the $\boldsymbol{s}$ which means the sensitivity and $S^m$ which means the number of layers $m$