Keywords: logistic regression, cross-entropy

Idea of logistic regression1

Logistic sigmoid function(logistic function for short) had been introduced in post ‘An Introduction to Probabilistic Generative Models for Linear Classification’. It has an elegant form:

$$
\delta(a)=\frac{1}{1+e^{-a}}\tag{1}
$$

and when $a=0$, $\delta(a)=\frac{1}{2}$ and this is just the half of the range of logistic function. This gives us a strong implication that we can set $a$ equals to some functions $y(\boldsymbol{x})$, and then

$$
a=y(\boldsymbol{x})=0\tag{2}
$$

becomes a decision boundary. Here the logistic function plays the same role as the threshold function described in the post ‘From Linear Regression to Linear Classification’

Logistic Regression of Linear Classification

The easiest function is a constant which corresponds to a 1-deminsional input. However, the dicision boundary of 1-deminsional input is a degenerated line, namely, a point. So we consider a little complex function – a 2-deminsional input vector $[x_1\;x_2]^T$ and the function $y(\boldsymbol{x})$ is:

$$
y(\boldsymbol{x})=w_0+w_1x_1+w_2x_2=\boldsymbol{w}^T\boldsymbol{x}=
\begin{bmatrix}w_0&w_1&w_2\end{bmatrix}
\begin{bmatrix}
1\\
x_1\\
x_2
\end{bmatrix}\tag{3}
$$

Then we substitute this into equation (1), we got our linear logsitic regression function:

$$
y(\boldsymbol{x})=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{4}
$$

The output of the function (4) is just 1-dimensional and its range is $0$ to $1$. So it can be used to represent a probability of the input belongs to $\mathcal{C}_1$ whose label is $1$ or $\mathcal{C}_0$ whose label is $0$ in the training set. In this form, we can just deal with a two-class mission because we have only one decision boundary.

Estimating the Parameters in Logistic Regression

Although logistic regression is called regression, it acts as a classifier. Our mission, now, is to estimate the parameters in equation(4).

Recalling that the output of equation(4) is $\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)$ where $M$ is the model we selected. And the model sometimes can be represented by its parameters. And the mission should you chose to accept it, is to build probability $\mathbb{P}(\boldsymbol{w}|\boldsymbol{x},t)$ where $\boldsymbol{x}$ is the input vector and $t\in\{0,1\}$ is the corresponding label and condition $\boldsymbol{x}$ is always been omitted. $t$ is one of $\mathcal{C}_1$ or $\mathcal{C}_2$, so the Bayesian relation of $\mathbb{P}(\boldsymbol{w}|t)$ is:

$$
\mathbb{P}(\boldsymbol{w}|t)=\frac{\mathbb{P}(t|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{x},M)\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}\tag{5}
$$

Then the maximise likelihood function is employed to estimate the parameters. And the likelihood is just:

$$
\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{6}
$$

when we have had the training set:

$$
\{\boldsymbol{x}_1,t_1\},\{\boldsymbol{x}_2,t_2\},\cdots,\{\boldsymbol{x}_N,t_N\}\tag{7}
$$

The likelihood becomes:

$$
\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\tag{8}
$$

In the second part, when $\boldsymbol{x}$ belongs to $\mathcal{C}_0$ the label is $0$. The output of this class should approach to $0$, so minimising $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$ equals to maximising $1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$. For equation(8) is not convenient to optimise, we can use the property that $t_n\in{0,1}$ and we have:

$$
\begin{aligned}
&\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\\
=&\Pi_i^N(\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{t_i}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{1-t_i}
\end{aligned}
\tag{9}
$$

From now on, we turn to an optimization problem. Maximizing equation(9) equals to minimize its minus logarithm(we use $y_i$ retpresent $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$):

$$
\begin{aligned}
E&=-\mathrm{ln}\;\Pi^N_{i=1}y_i^{t_i}(1-y_i)^{1-t_i}\\
&=-\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))
\end{aligned}
\tag{10}
$$

Equation(10) is called cross-entropy, which is a very important concept in information theory and is called cross-entropy error in machine learning which is also a very useful function.

For there is no closed-form solution, ‘steepest descent algorithm’ is employed. And what we need to calculate firstly is the derivative of equation(10). For we want to get the derivative of $\boldsymbol{w}$ who is in the function $y_i(\boldsymbol{x})$, the chain rule is necessary:

$$
\begin{aligned}
\frac{dE}{dw}&=-\frac{d}{dw}\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))\\
&=-\sum^N_{i=1}\frac{d}{dw}t_i\mathrm{ln}y_i+\frac{d}{dw}(1-t_i)\mathrm{ln}(1-y_i)\\
&=-\sum^N_{i=1}t_i\frac{y_i’}{y_i}+(1-t_i)\frac{-y’_i}{1-y_i}
\end{aligned}
\tag{11}
$$

and for

$$
\begin{aligned}
y_i’&=\frac{d}{dw}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\
&=\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}
\end{aligned}\tag{12}
$$

Substitute equation (12) into equation (11), we have:

$$
\begin{aligned}
\frac{dE}{dw}&=-\sum_{i=1}^Nt_i\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}-(1-t_i)\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}\\
&=-\sum_{i=1}^Nt_i\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{\boldsymbol{x}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\
&=-\sum_{i=1}^N(t_i\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\boldsymbol{x}\\
&=-\sum_{i=1}^N(t_i(1-y_i)-(1-t_i)y_i)\boldsymbol{x}\\
&=-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x}
\end{aligned}\tag{13}
$$

Then we can update $\boldsymbol{w}$ :

$$
\boldsymbol{w} = \boldsymbol{w} – \mathrm{learning\; rate} \times (-\frac{1}{N}\sum_{i=1}^N(t_i-y_i)\boldsymbol{x})\tag{14}
$$

Code for logistic regression

The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.

Two hyperparameters are the learning rate and the stop condition. When the error is lower than the stop value, the algorithm stops.

Experiment 1

with the learning rate 1, and different learning rates lead to different convergence rate:

Experiment 2

with the learning rate 1, and different learning rates lead to different convergence rate:

Experiment 3

with the learning rate 1, and different learning rates lead to different convergence rate:

Several Traps in Logistic Regression

  1. Input value should be nomarlized and centered at 0
  2. Learning rate is chosen corroding to equation (14) but not $-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x}$ because the uncertain coefficient $N$
  3. parameter vector $\boldsymbol{w}$ identifies the direction, so its margin can be arbitrary large. and to a large $\boldsymbol{w}$ , $y_i(\boldsymbol{x})$ is very close to 1 or 0, but it can never be 1 or 0. So there is no optimization position, and the equation(13) can never be $0$ which means the algorithm can never stop by himself
  4. Considering $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}$, when the margin of $\boldsymbol{w}$ grows, we can write it in a combination of margin $M$ and direction vector $\boldsymbol{w}_d$: $\frac{1}{1+e^{-M(\boldsymbol{w_d}^T\boldsymbol{x}})}$. And the function $\frac{1}{1+e^{-M(x)}}$ varies according $M$ is like(when $M$ grows the logistic function is more like Heaviside step function):

References


1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Last modified: March 24, 2020