**Keywords:** BP, backpropagation, multilayer perceptrons

## Form LMS to Backpropagation^{1}

The LMS algorithm had been introduced before. It’s a kind of ‘performance learning’. And we have studied several learning rules(algorithms), such as ‘Perceptron learning rule’ and ‘Supervised Hebbian learning’ were based on the idea of the physical mechanism of biological neuron networks. And then performance learning was represented. From that time on, we go further and further away from natural intelligence.

LMS can only solve the classification task which is linear separable. And then backpropagation(BP for short) which is a generalization of LMS algorithm was introduced for more complex problems. And backpropagation is also an approximation of the steepest descent algorithm. The performance index of the problem which was supposed to be solved by backpropagation was MSE.

The distinction between BP and LMS is how derivative is calculated:

- 1-layer network: $\frac{\partial e}{\partial w}$ is relatively easy to compute.
- multiple-layer network: $\frac{\partial e}{\partial w_{i,j}}$ is complex. And then chain rule would be employed to deal with the multiple-layer network who also has nonlinear transfer functions.

## Brief History of BP

Rosenblatt and Widrow knew the disadvantage of a single-layer network that they can only solve the linear separable tasks. So they brought up the multilayer network. However, they had not developed an efficient learning rule to train a multilayer network.

In 1974, the first procedure of training a multilayer network was introduced by Paul Werbos in his thesis. However, this thesis was not noticed by researchers. Until 1985 and 1986, David Parker, Yann LeCun and Geoffry Hiton proposed the BP algorithm respectively. And in 1986, the book of David R. and James M. ‘Parallel Distributed Processing’ made the algorithm known widely.

In these several posts we would like to investigate:

- The capacity of the multilayer network
- BP algorithm

## Multilayer Perceptrons

Let’s consider the 3-layer network:

whose ouput is:

\boldsymbol{a}^3=\boldsymbol{f}^3(W^3\boldsymbol{f}^2(W^2\boldsymbol{f}^1(W^1\boldsymbol{p}+\boldsymbol{b}^1)+\boldsymbol{b}^2)+\boldsymbol{b}^3)

$$

and because all the outputs of one layer are inputs to the next layer and this makes it possible that the network can be notated as:

where $R$ represents number of input and $S^i$ for $i=1,2,3$ is the number of neurons of layer 1,2,3.

We have now had the architecture of the new model multilayer network. What we should do next is to investigate the capacity of the multilayer network in:

- Pattern classification
- Function Approximation

### Pattern Classification

Firstly, let’s have a look at a famous logical problem ‘exclusive-or’ or ‘XOR’ for short. This problem was famous for it can not be solved by a single-layer network which is proposed by Minsky and Papert in 1969. And this simple but unsolvable problem made the neuron network disappear for decades.

A Multilayer network was then invented to solve the ‘XOR’ problem. The input/output pairs of ‘XOR’ are

\{\boldsymbol{p}_1=\begin{bmatrix}

0\\0

\end{bmatrix},\boldsymbol{t}_1=0\}\\

\{\boldsymbol{p}_2=\begin{bmatrix}

0\\1

\end{bmatrix},\boldsymbol{t}_2=1\}\\

\{\boldsymbol{p}_3=\begin{bmatrix}

1\\0

\end{bmatrix},\boldsymbol{t}_3=1\}\\

\{\boldsymbol{p}_4=\begin{bmatrix}

1\\1

\end{bmatrix},\boldsymbol{t}_4=0\}

$$

and these points are not linear separable:

and if we use a 2-layer network, the XOR problem can be solved:

where these two lines can be constructed by two neurons:

- The blue line can be $y=-x+0.5$ and its neuron model is:

- The green line can be $y=-x+1.5$ and its neuron model is:

And these two lines(neurons) can be mixed and constructed to a 2-layer network(2-2-1 network):

This gave a solution to the non-linear separable problem ‘XOR’. However, this is not a learning rule that means it could not be generalized to other more complex problems.

### Function Approximation

Besides classification, another task of the neuron network is function approximation. If we consider intelligence as a very intricate function, and the capacity of the neuron network in approximating function should be investigated. This capacity is also known as the model’s flexibility. Now let’s discuss the flexibility of a multilayer perceptron for implementing functions. A simple example is a good way to look inside of properties of a model without unnecessary details. So a ‘1-2-1’ network whose transfer function is logic-sigmoid in the first layer and linear function in the second layer is introduced:

when $w^1_{1,1}=10$, $w^1_{2,1}=10$, $b^1_1=-10$, $b^1_2=10$, $w^2_{1,1}=1$, $w^2_{2,1}=1$, $b^2=0$. It looks like:

Each step can be changed by changing parameters. Because the steps are centered at

- $n^1_1=0$ at $p=1$
- $n^1_2=0$ at $p=-1$

and steps can be changed by changing weights. When $w^1_{1,1}=20$, $w^1_{2,1}=20$, $b^1_1=-10$, $b^1_2=10$, $w^2_{1,1}=1$, $w^2_{2,1}=1$, $b^2=0$. It looks like the gray line in the figure:

Now let have a close look at how the curve of neuron networks looks like when one of the parameters is changing.