Keywords: logistic regression, cross-entropy

## Idea of logistic regression1

Logistic sigmoid function(logistic function for short) had been introduced in post ‘An Introduction to Probabilistic Generative Models for Linear Classification’. It has an elegant form:

$$\delta(a)=\frac{1}{1+e^{-a}}\tag{1}$$

and when $a=0$, $\delta(a)=\frac{1}{2}$ and this is just the half of the range of logistic function. This gives us a strong implication that we can set $a$ equals to some functions $y(\boldsymbol{x})$, and then

$$a=y(\boldsymbol{x})=0\tag{2}$$

becomes a decision boundary. Here the logistic function plays the same role as the threshold function described in the post ‘From Linear Regression to Linear Classification’

## Logistic Regression of Linear Classification

The easiest function is a constant which corresponds to a 1-deminsional input. However, the dicision boundary of 1-deminsional input is a degenerated line, namely, a point. So we consider a little complex function – a 2-deminsional input vector $[x_1\;x_2]^T$ and the function $y(\boldsymbol{x})$ is:

$$y(\boldsymbol{x})=w_0+w_1x_1+w_2x_2=\boldsymbol{w}^T\boldsymbol{x}= \begin{bmatrix}w_0&w_1&w_2\end{bmatrix} \begin{bmatrix} 1\\ x_1\\ x_2 \end{bmatrix}\tag{3}$$

Then we substitute this into equation (1), we got our linear logsitic regression function:

$$y(\boldsymbol{x})=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{4}$$

The output of the function (4) is just 1-dimensional and its range is $0$ to $1$. So it can be used to represent a probability of the input belongs to $\mathcal{C}_1$ whose label is $1$ or $\mathcal{C}_0$ whose label is $0$ in the training set. In this form, we can just deal with a two-class mission because we have only one decision boundary.

## Estimating the Parameters in Logistic Regression

Although logistic regression is called regression, it acts as a classifier. Our mission, now, is to estimate the parameters in equation(4).

Recalling that the output of equation(4) is $\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)$ where $M$ is the model we selected. And the model sometimes can be represented by its parameters. And the mission should you chose to accept it, is to build probability $\mathbb{P}(\boldsymbol{w}|\boldsymbol{x},t)$ where $\boldsymbol{x}$ is the input vector and $t\in\{0,1\}$ is the corresponding label and condition $\boldsymbol{x}$ is always been omitted. $t$ is one of $\mathcal{C}_1$ or $\mathcal{C}_2$, so the Bayesian relation of $\mathbb{P}(\boldsymbol{w}|t)$ is:

$$\mathbb{P}(\boldsymbol{w}|t)=\frac{\mathbb{P}(t|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{x},M)\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}\tag{5}$$

Then the maximise likelihood function is employed to estimate the parameters. And the likelihood is just:

$$\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{6}$$

when we have had the training set:

$$\{\boldsymbol{x}_1,t_1\},\{\boldsymbol{x}_2,t_2\},\cdots,\{\boldsymbol{x}_N,t_N\}\tag{7}$$

The likelihood becomes:

$$\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\tag{8}$$

In the second part, when $\boldsymbol{x}$ belongs to $\mathcal{C}_0$ the label is $0$. The output of this class should approach to $0$, so minimising $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$ equals to maximising $1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$. For equation(8) is not convenient to optimise, we can use the property that $t_n\in{0,1}$ and we have:

\begin{aligned} &\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\\ =&\Pi_i^N(\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{t_i}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{1-t_i} \end{aligned} \tag{9}

From now on, we turn to an optimization problem. Maximizing equation(9) equals to minimize its minus logarithm(we use $y_i$ retpresent $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$):

\begin{aligned} E&=-\mathrm{ln}\;\Pi^N_{i=1}y_i^{t_i}(1-y_i)^{1-t_i}\\ &=-\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i)) \end{aligned} \tag{10}

Equation(10) is called cross-entropy, which is a very important concept in information theory and is called cross-entropy error in machine learning which is also a very useful function.

For there is no closed-form solution, ‘steepest descent algorithm’ is employed. And what we need to calculate firstly is the derivative of equation(10). For we want to get the derivative of $\boldsymbol{w}$ who is in the function $y_i(\boldsymbol{x})$, the chain rule is necessary:

\begin{aligned} \frac{dE}{dw}&=-\frac{d}{dw}\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))\\ &=-\sum^N_{i=1}\frac{d}{dw}t_i\mathrm{ln}y_i+\frac{d}{dw}(1-t_i)\mathrm{ln}(1-y_i)\\ &=-\sum^N_{i=1}t_i\frac{y_i’}{y_i}+(1-t_i)\frac{-y’_i}{1-y_i} \end{aligned} \tag{11}

and for

\begin{aligned} y_i’&=\frac{d}{dw}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\ &=\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2} \end{aligned}\tag{12}

Substitute equation (12) into equation (11), we have:

\begin{aligned} \frac{dE}{dw}&=-\sum_{i=1}^Nt_i\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}-(1-t_i)\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}\\ &=-\sum_{i=1}^Nt_i\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{\boldsymbol{x}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\ &=-\sum_{i=1}^N(t_i\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\boldsymbol{x}\\ &=-\sum_{i=1}^N(t_i(1-y_i)-(1-t_i)y_i)\boldsymbol{x}\\ &=-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x} \end{aligned}\tag{13}

Then we can update $\boldsymbol{w}$ :

$$\boldsymbol{w} = \boldsymbol{w} – \mathrm{learning\; rate} \times (-\frac{1}{N}\sum_{i=1}^N(t_i-y_i)\boldsymbol{x})\tag{14}$$

## Code for logistic regression

The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.

Two hyperparameters are the learning rate and the stop condition. When the error is lower than the stop value, the algorithm stops.

### Experiment 1 with the learning rate 1, and different learning rates lead to different convergence rate: ### Experiment 2 with the learning rate 1, and different learning rates lead to different convergence rate: ### Experiment 3 with the learning rate 1, and different learning rates lead to different convergence rate: ## Several Traps in Logistic Regression

1. Input value should be nomarlized and centered at 0
2. Learning rate is chosen corroding to equation (14) but not $-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x}$ because the uncertain coefficient $N$
3. parameter vector $\boldsymbol{w}$ identifies the direction, so its margin can be arbitrary large. and to a large $\boldsymbol{w}$ , $y_i(\boldsymbol{x})$ is very close to 1 or 0, but it can never be 1 or 0. So there is no optimization position, and the equation(13) can never be $0$ which means the algorithm can never stop by himself
4. Considering $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}$, when the margin of $\boldsymbol{w}$ grows, we can write it in a combination of margin $M$ and direction vector $\boldsymbol{w}_d$: $\frac{1}{1+e^{-M(\boldsymbol{w_d}^T\boldsymbol{x}})}$. And the function $\frac{1}{1+e^{-M(x)}}$ varies according $M$ is like(when $M$ grows the logistic function is more like Heaviside step function): ## References

1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.