**Keywords:** logistic regression, cross-entropy

## Idea of logistic regression^{1}

Logistic sigmoid function(logistic function for short) had been introduced in post ‘An Introduction to Probabilistic Generative Models for Linear Classification’. It has an elegant form:

\delta(a)=\frac{1}{1+e^{-a}}\tag{1}

$$

and when $a=0$, $\delta(a)=\frac{1}{2}$ and this is just the half of the range of logistic function. This gives us a strong implication that we can set $a$ equals to some functions $y(\boldsymbol{x})$, and then

a=y(\boldsymbol{x})=0\tag{2}

$$

becomes a decision boundary. Here the logistic function plays the same role as the threshold function described in the post ‘From Linear Regression to Linear Classification’

## Logistic Regression of Linear Classification

The easiest function is a constant which corresponds to a 1-deminsional input. However, the dicision boundary of 1-deminsional input is a degenerated line, namely, a point. So we consider a little complex function – a 2-deminsional input vector $[x_1\;x_2]^T$ and the function $y(\boldsymbol{x})$ is:

y(\boldsymbol{x})=w_0+w_1x_1+w_2x_2=\boldsymbol{w}^T\boldsymbol{x}=

\begin{bmatrix}w_0&w_1&w_2\end{bmatrix}

\begin{bmatrix}

1\\

x_1\\

x_2

\end{bmatrix}\tag{3}

$$

Then we substitute this into equation (1), we got our linear logsitic regression function:

y(\boldsymbol{x})=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{4}

$$

The output of the function (4) is just 1-dimensional and its range is $0$ to $1$. So it can be used to represent a probability of the input belongs to $\mathcal{C}_1$ whose label is $1$ or $\mathcal{C}_0$ whose label is $0$ in the training set. In this form, we can just deal with a two-class mission because we have only one decision boundary.

## Estimating the Parameters in Logistic Regression

Although logistic regression is called regression, it acts as a classifier. Our mission, now, is to estimate the parameters in equation(4).

Recalling that the output of equation(4) is $\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)$ where $M$ is the model we selected. And the model sometimes can be represented by its parameters. And the mission should you chose to accept it, is to build probability $\mathbb{P}(\boldsymbol{w}|\boldsymbol{x},t)$ where $\boldsymbol{x}$ is the input vector and $t\in\{0,1\}$ is the corresponding label and condition $\boldsymbol{x}$ is always been omitted. $t$ is one of $\mathcal{C}_1$ or $\mathcal{C}_2$, so the Bayesian relation of $\mathbb{P}(\boldsymbol{w}|t)$ is:

\mathbb{P}(\boldsymbol{w}|t)=\frac{\mathbb{P}(t|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{x},M)\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}\tag{5}

$$

Then the maximise likelihood function is employed to estimate the parameters. And the likelihood is just:

\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{6}

$$

when we have had the training set:

\{\boldsymbol{x}_1,t_1\},\{\boldsymbol{x}_2,t_2\},\cdots,\{\boldsymbol{x}_N,t_N\}\tag{7}

$$

The likelihood becomes:

\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\tag{8}

$$

In the second part, when $\boldsymbol{x}$ belongs to $\mathcal{C}_0$ the label is $0$. The output of this class should approach to $0$, so minimising $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$ equals to maximising $1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$. For equation(8) is not convenient to optimise, we can use the property that $t_n\in{0,1}$ and we have:

\begin{aligned}

&\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\\

=&\Pi_i^N(\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{t_i}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{1-t_i}

\end{aligned}

\tag{9}

$$

From now on, we turn to an optimization problem. Maximizing equation(9) equals to minimize its minus logarithm(we use $y_i$ retpresent $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}$):

\begin{aligned}

E&=-\mathrm{ln}\;\Pi^N_{i=1}y_i^{t_i}(1-y_i)^{1-t_i}\\

&=-\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))

\end{aligned}

\tag{10}

$$

Equation(10) is called **cross-entropy**, which is a very important concept in information theory and is called cross-entropy error in machine learning which is also a very useful function.

For there is no closed-form solution, ‘steepest descent algorithm’ is employed. And what we need to calculate firstly is the derivative of equation(10). For we want to get the derivative of $\boldsymbol{w}$ who is in the function $y_i(\boldsymbol{x})$, the chain rule is necessary:

\begin{aligned}

\frac{dE}{dw}&=-\frac{d}{dw}\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))\\

&=-\sum^N_{i=1}\frac{d}{dw}t_i\mathrm{ln}y_i+\frac{d}{dw}(1-t_i)\mathrm{ln}(1-y_i)\\

&=-\sum^N_{i=1}t_i\frac{y_i’}{y_i}+(1-t_i)\frac{-y’_i}{1-y_i}

\end{aligned}

\tag{11}

$$

and for

\begin{aligned}

y_i’&=\frac{d}{dw}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\

&=\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}

\end{aligned}\tag{12}

$$

Substitute equation (12) into equation (11), we have:

\begin{aligned}

\frac{dE}{dw}&=-\sum_{i=1}^Nt_i\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}-(1-t_i)\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}\\

&=-\sum_{i=1}^Nt_i\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{\boldsymbol{x}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\

&=-\sum_{i=1}^N(t_i\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\boldsymbol{x}\\

&=-\sum_{i=1}^N(t_i(1-y_i)-(1-t_i)y_i)\boldsymbol{x}\\

&=-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x}

\end{aligned}\tag{13}

$$

Then we can update $\boldsymbol{w}$ :

\boldsymbol{w} = \boldsymbol{w} – \mathrm{learning\; rate} \times (-\frac{1}{N}\sum_{i=1}^N(t_i-y_i)\boldsymbol{x})\tag{14}

$$

## Code for logistic regression

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
class LogisticRegression(): def logistic_sigmoid(self, a): return 1./(1.0 + np.exp(-a)) def fit(self, x, y, e_threshold, lr=1): x = np.array(x)/320.-1 # augment the input x_dim = x.shape[0] x = np.c_[np.ones(x_dim), x] # initial parameters in 0.01 to 1 w = np.random.randint(1, 100, x.shape[1])/100. number_of_points = np.size(y) for dummy in range(1000): y_output = self.logistic_sigmoid(w.dot(x.transpose())) # gradient calculation e_gradient = np.zeros(x.shape[1]) for i in range(number_of_points): e_gradient += (y_output[i]-y[i])*x[i] e_gradient = e_gradient / number_of_points # update parameter w += -e_gradient*lr e = 0 for i in range(number_of_points): e += -(y[i] * np.log(y_output[i]) + (1 - y[i]) * np.log(1 - y_output[i])) e /= number_of_points if e <= e_threshold: break return w |

The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.

Two hyperparameters are the learning rate and the stop condition. When the error is lower than the stop value, the algorithm stops.

### Experiment 1

with the learning rate 1, and different learning rates lead to different convergence rate:

### Experiment 2

with the learning rate 1, and different learning rates lead to different convergence rate:

### Experiment 3

with the learning rate 1, and different learning rates lead to different convergence rate:

## Several Traps in Logistic Regression

- Input value should be nomarlized and centered at 0
- Learning rate is chosen corroding to equation (14) but not $-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x}$ because the uncertain coefficient $N$
- parameter vector $\boldsymbol{w}$ identifies the direction, so its margin can be arbitrary large. and to a large $\boldsymbol{w}$ , $y_i(\boldsymbol{x})$ is very close to 1 or 0, but it can never be 1 or 0. So there is no optimization position, and the equation(13) can never be $0$ which means the algorithm can never stop by himself
- Considering $\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}$, when the margin of $\boldsymbol{w}$ grows, we can write it in a combination of margin $M$ and direction vector $\boldsymbol{w}_d$: $\frac{1}{1+e^{-M(\boldsymbol{w_d}^T\boldsymbol{x}})}$. And the function $\frac{1}{1+e^{-M(x)}}$ varies according $M$ is like(when $M$ grows the logistic function is more like Heaviside step function):