## Boosting1

The committee has an equal weight for every prediction from all models, and it gives little improvement than a single model. Then boosting was built for this problem. Boosting is a technique for combining multiple ‘base’ classifiers to produce a form of the committee that:

1. performances better than any of base classifier and
2. each base classifier has a different weight factor

Adaboost is short for adaptive boosting. It is a method combining several weak classifiers which are just better than random guesses and it gives a better performance than the committee. The base classifiers in AdaBoost are trained in sequence, and their training set is the same but with different weights. So if we consider the training data distribution, the distribution faced by every weak classifier is different. This might be an important reason for the improvement of AdaBoost from the committee. And the weights for weak classifiers are generated depending on the performance of the previous classifier. Then the weak classifiers in the algorithm would be trained one by one. During the prediction process, the input data flows from classifier to classifier and the final result is some kind of combination of all output of weak classifiers.

Key points of the AdaBoost algorithm are:

1. the data points are predicted incorrectly in current classifier gives a greater weight
2. once the algorithm was trained, prediction of each classifier is combined through a weighted majority voting scheme as:

where $w_n^{(1)}$ is the initial weights of input data of the $1$st weak classifier, $y_1(x)$ is the prediction of the $1$st weak classifier, $\alpha_m$ is the weight of each prediction(notably this weight belongs to $y_m(x)$ and $w_n^{(1)}$ belongs to input data of the first classifier.). And the final output is the sign function of the weighted sum of all predictions.
The procedure of algorithm is:

1. Initial data weighting coefficients $\{\boldsymbol{w}_n\}$ by $w_n^{(1)}=\frac{1}{N}$ for $n=1,2,\cdots,N$
2. For $m=1,\dots,M$:
• Fit a classifier $y_m(\boldsymbol{x})$ to training set by minimizing the weighted error function:
$$J_m=\sum_{}^{}w_n^{(m)}I(y_m(\boldsymbol{x}_n)\neq t_n)$$
where $I(y_m(\boldsymbol{x})\neq t_n)$is the indicator function and equals 1 when $y_m(\boldsymbol{x})\neq t_n$ and 0 otherwise
• Evaluate the quatities:
$$\epsilon_m=\frac{\sum_{n=1}^Nw_n^{(m)}I(y_m(\boldsymbol{x})\neq t_n)}{\sum_{n=1}^{N}w_n^{(m)}}$$
and then use this to evaluate $\alpha_m=\ln \{\frac{1-\epsilon_m}{\epsilon_m}\}$
• Updata the data weighting coefficients:
$$w_n^{(m+1)}=w_n^{(m)}\exp\{\alpha_mI(y_m(\boldsymbol{x})\neq t_n)\}$$
1. Make predictions using the final model, which is given by:
$$Y_M = \mathrm{sign} (\sum_{m=1}^{M}\alpha_my_m(x))$$

This procedure comes from ‘Pattern recognition and machine learning’1

the entire project can be found https://github.com/Tony-Tan/ML. And please star me! Thanks!

When we use different numbers of classifiers, the results of the algorithm are like:

where the blue circles are correct classification of class 1 and red circles are correct classification of class 2. And the blue crosses belong to class 2 but were classified into class 1, and so do the red crosses.

A 40-classifiers AdaBoost gives a relatively good prediction:

where there is only one misclassification point.

## References

1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.