**Keywords:** maximum likelihood estimation

## Square Loss Function for Regression^{1}

To any input $\boldsymbol{x}$, our goal in a regression task is to give a prediction $\hat{y}=y(\boldsymbol{x})$ to approximate target $t$ where the function $y$ is the chosen hypothesis. And the difference between $t$ and $\hat{y}$ can be called ‘error’ or more precisely ‘loss’. Because in an approximation task, ‘error’ occures by chance and alway exists, but ‘loss’ is a good word to describe the difference. The loss can be written generally as function $\ell(y(\boldsymbol{x}),t)$. Intuitively, the smaller the loss, the better the approximation. So the expection of loss:

\mathbb E[\ell]=\int\int \ell(y(\boldsymbol{x}),t)p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{1}

$$

should be small.

In probability viewpoint, the input vector $\boldsymbol{x}$, target $t$ and parameters in function $y$ are all random variables. So the expectation of loss function exists.

Considering the sqare error loss function $e=(y(\boldsymbol{x})-t)^2$, it is a way to present the difference between the prediction of our prediction and the target. And substitute the loss funtion into equation (1), we have:

\mathbb E[\ell]=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{2}

$$

To minimize this function, we could use Euler-Lagrange equation, Fundamental theorem of calculus and Fubini’s theorem:

Fubini’s theorem told us that we can change the order of intergration:

\begin{aligned}

\mathbb E[\ell]&=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\\

&=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dtd \boldsymbol{x}

\end{aligned}\tag{3}

$$

According to the Euler-Lagrange equation, we first create a new function $G(x,y,y’)$:

G(x,y,y’)= \int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dt\tag{4}

$$

Euler-Lagrange equation is used to minimize the equation (2):

\frac{\partial G}{\partial y}-\frac{d}{dx}\frac{\partial G}{\partial y’}=0\tag{5}

$$

Because there is no $y’$ component in function $G()$. Then the equation:

\frac{\partial G}{\partial y}=0\tag{6}

$$

becomes the necessary condition to minimize the equation (2):

2\int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt=0 \tag{7}

$$

What we want to find to minimize the loss is predictor $y$, so rearrange the equation (6), and we get a good predictor that can minimize the square loss function :

\begin{aligned}

\int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt&=0\\

\int y(\boldsymbol{x})p(\boldsymbol{x},t)dt-\int tp(\boldsymbol{x},t)dt&=0\\

y(\boldsymbol{x})\int p(\boldsymbol{x},t)dt&=\int tp(\boldsymbol{x},t)dt\\

y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{\int p(\boldsymbol{x},t)dt}\\

y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{p(\boldsymbol{x})}\\

y(\boldsymbol{x})&=\int tp(t|\boldsymbol{x})dt\\

y(\boldsymbol{x})&= \mathbb{E}_t[t|\boldsymbol{x}]

\end{aligned}\tag{8}

$$

To minimize the expection of the square loss function, we finally find the expection of $t$ given $\boldsymbol{x}$ is the optimum solution to the task which means the solution of $y(\boldsymbol{x})=\mathbb{E[t| \boldsymbol{x}]}$. The expection of $t$ given $\boldsymbol{x}$ is also called the **regression function**.

A small summary: $\mathbb{E[t| \boldsymbol{x}]}$ is a good expection of $y(\boldsymbol{x})$

## Maximum Likelihood Estimation

Generally, we assume that there is a generator behand the data:

t=g(\boldsymbol{x},\boldsymbol{w})+\epsilon\tag{9}

$$

where the function $g(\boldsymbol{x},\boldsymbol{w})$ is a deterministic function, $t$ is the target variable and $\epsilon$ is zero mean Gaussian random variable with percision $\beta$ which is the inverse variance. Because of the property of Gaussian distribution, $t$ has a Gaussian distribution, with mean(expectation) $g(\boldsymbol{x},\boldsymbol{w})$ and percesion $\beta$. And recalling the standard form of Gaussian distribution:

\begin{aligned}

\mathbb{P}(t|\boldsymbol{x},\boldsymbol{w},\beta)&=\mathcal{N}(t|g(\boldsymbol{x},\boldsymbol{w}),\beta^{-1})\\

&=\frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(x-\mu)^2)}

\end{aligned}\tag{10}

$$

What our task here is to approximate the generator in equation (9) with a linear function. Somehow, when we use the square loss function, the optimum solution for this task is $\mathbb{E}[t|\boldsymbol{x}]$ with respect of equation (8). And to the equation (10) the solution is:

\mathbb{E}[t|\boldsymbol{x}]=g(\boldsymbol{x},\boldsymbol{w})\tag{11}

$$

Then we set our linear model as:

y(x)=\boldsymbol{w}^T\boldsymbol{x}+b\tag{12}

$$

and this can be transformed as:

y(x)=

\begin{bmatrix}

b&\boldsymbol{w}^T

\end{bmatrix}

\begin{bmatrix}

1\\

\boldsymbol{x}

\end{bmatrix}=\boldsymbol{w}_a^T\boldsymbol{x}_a

\tag{13}

$$

for short, we just write the $\boldsymbol{w}_a$ and $\boldsymbol{x}_a$ as $\boldsymbol{w}$ and $\boldsymbol{x}$. Then the linear model becomes:

y(x)=\boldsymbol{w}^T\boldsymbol{x}\tag{14}

$$

and to get the prediction we need to find out what $\boldsymbol{w}$ is. As we mentioned above we consider all the parameter as a random variable, then the conditioned distribution of $\boldsymbol{w}$ is $\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)$. $X$ or $\boldsymbol{x}$ could be omitted for its distribution will not be concerned. And the Bayesian theorem told us:

\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)=\frac{\mathbb{P}(

\boldsymbol{t}|\boldsymbol{w},\beta)

\mathbb{P}(\boldsymbol{w})}

{\mathbb{P}(\boldsymbol{t})}=\frac{\text{Likelihood}\times \text{Prior}}{\text{Evidence}}\tag{15}

$$

We want to find the $\boldsymbol{w}^{\star}$ that maximise the posterior probability $\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)$. Because $\mathbb{P}(\boldsymbol{t})$ and $\mathbb{P}(\boldsymbol{w})$ are constant. Then the maximum of likelihood $\mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)$ maximise the posterior probability.

\begin{aligned}

\mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\Pi_{i=0}^{N}\mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\

\ln \mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\sum_{i=0}^{N}\ln \mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\

&=\sum_{i=0}^{N}\ln \frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2)}\\

&=\sum_{i=0}^{N} \ln \beta – \sum_{i=0}^{N} \ln \sqrt{2\pi} – \frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2

\end{aligned}\tag{16}

$$

This gives us a wanderful result. The last part of the equation (16) has only the component $\frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2$ that can be controled by us, because $\sum_{i=0}^{N} \ln \beta$ and $- \sum_{i=0}^{N} \ln \sqrt{2\pi}$ are decided by the assumptions. In other words, to maximise the likelihood, we just need to minimise:

\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2\tag{17}

$$

This was just to minimize the sum of squares. Then this optimization problem went back to the least square problem.

## Least Square Estimation and Maximum Likelihood Estimation

When we assume there is a generator:

t=g(\boldsymbol{x},\boldsymbol{w})+\epsilon\tag{18}

$$

behind the data, and $\epsilon$ has a zero-mean Gaussian distribution with any precision $\beta$, the maximum likelihood estimation finally converts to the least square estimation. This is not only worked for linear regression, because we have no assumption about what $g(\boldsymbol{x},\boldsymbol{w})$ is.

However, when the $\epsilon$ has a different distribution but not Gaussian distribution, the least square estimation will not be the optimum solution.