Keywords: committees

## Analysis of Committees1

The committee is a native inspiration for how to combine several models(or we can say how to combine the outputs of several models). For example, we can combine all the models by:

$$y_{COM}(X)=\frac{1}{M}\sum_{m=1}^My_m(X)\tag{1}$$

However, we want to analyze whether this average prediction of models is good than a single one of them.

To compare the committee and a single model, we should first build a criterion depending on which we can distinguish which model is better. Assuming that the true generator of the training data $x$ is:

$$h(x)\tag{2}$$

So our prediction of the $m$th model for $m=1,2\cdots,M$ can be represented as:

$$y_m(x) = h(x) +\epsilon_m(x)\tag{3}$$

then the average sum-of-square of error can be a nice criterion. And the criterion of sigle model is:

$$\mathbb{E}_x[(y_m(x)-h(x))^2] = \mathbb{E}_x[\epsilon_m(x)^2] \tag{4}$$

where the $\mathbb{E}[\cdot]$ is the frequentist expectation. To make the creterion more concrete, we consider the average of the error over $M$ models:

$$E_{AV} = \frac{1}{M}\sum_{m=1}^M\mathbb{E}_x[\epsilon_m(x)^2]\tag{5}$$

And on the other hand, the committees have the error given by equation(1), (3) and (4):

\begin{aligned} E_{COM}&=\mathbb{E}_x[(\frac{1}{M}\sum_{m=1}^My_m(x)-h(x))^2] \\ &=\mathbb{E}_x[\{\frac{1}{M}\sum_{m=1}^M\epsilon_m(x)\}^2] \end{aligned} \tag{6}

Now we assume that the random variables $\epsilon_i(x)$ for $i=1,2,\cdots,M$ have mean 0 and uncorrelated, so that:

\begin{aligned} \mathbb{E}_x[\epsilon_m(x)]&=0 &\\ \mathbb{E}_x[\epsilon_m(x)\epsilon_l(x)]&=0,&m\neq l \end{aligned} \tag{7}

Then subsitute equ (7) into equ (6), we can get:

$$E_{COM}=\frac{1}{M^2}\mathbb{E}_x[\epsilon_m(x)]\tag{8}$$

According the equation (5) and (8):

$$E_{AV}=\frac{1}{M}E_{COM}\tag{9}$$

All the mathematics above is based on the assumption that the error of each model is uncorrelated. However, most time they are highly correlated and the reduction of error is generally small. But the relation:

$$E_{COM}\leq E_{AV}\tag{10}$$

exists for sure. Then the boosting method was established to make the combining model more powerful.

## References

1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.