Keywords: Generative models

Probabilistic Generative Models1

The generative model used for making decisions contains the inference step and decision step:

  1. Inference step is using probability or other theory to calculate $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$ which means to a given $\boldsymbol{x}$ the probability of $\boldsymbol{x}$ belonging to the class $\mathcal{C}_k$
  2. Decision step is to make a decision based on $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$ which was calculated in step 1

In this post, we just give an introduction and a framework for the probabilistic generative model in classification. And the details of how to estimating the parameters in the model will not be introduced.

From Bayesian Formular to Logistic Sigmoid Function

To build $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$, we can start from Bayesian formula. To the class $\mathcal{C}_1$ of a two-classes problem, the posterior probability:

$$
\begin{aligned}
\mathbb{P}(\mathcal{C}_1|\boldsymbol{x})&=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)+\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}\\
&=\frac{1}{1+\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}}
\end{aligned}\tag{1}
$$

represents a new function:

$$
\begin{aligned}
\mathbb{P}(\mathcal{C}_1|\boldsymbol{x})&=\delta(a)\\
&=\frac{1}{1+e^{-a}}
\end{aligned}\tag{2}
$$

where:

$$
a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}\tag{3}
$$

An usual question is why we set $a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}$ but not $a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}$. In my opinion, this $a$ just determine the graph of function $\delta(a)$. However, we perfer monotone-increasing function and $\frac{1}{1+e^{-a}}$ is just a monotone-increasing function but $\frac{1}{1+e^{a}}$ is not.

$\delta(\cdot)$ is called logistic sigmoid function or squashing function, because it maps any number into interval $(0,1)$. The range of the function is just within the range of probability. So it is a good way to represent some kinds of probability, such as the $\mathbb{P}(\mathcal{C}_1|\boldsymbol{x})$. the shape of the logistic sigmoid function is:

Some Properties of Logistic Sigmoid

For the logistic sigmoid function is symmetrical, then:

$$
1-\delta(a)=\frac{e^{-a}}{1+e^{-a}}=\frac{1}{e^a+1}\tag{4}
$$

and:

$$
\delta(-a)=\frac{1}{1+e^a}\tag{5}
$$

So, we have an important equation:

$$
1-\delta(a)=\delta(-a)\tag{6}
$$

The inverse function of $y=\delta(a)$ is:

$$
\begin{aligned}
y&=\frac{1}{1+e^{-a}}\\
e^{-a}&=\frac{1}{y}-1\\
a&=-\mathrm{ln}(\frac{1-y}{y})\\
a&=\mathrm{ln}(\frac{y}{1-y})
\end{aligned}\tag{7}
$$

The derivative of logistic sigmoid function is:

$$
\frac{d\delta(a)}{d a}=\frac{e^{-a}}{(1+e^{-a})^2}=(1-\delta(a))\delta(a)\tag{8}
$$

Multiple Classes Problems

We, now, extend the logistic sigmoid function into multiple classes condition. And we also start from Bayesian formular:

$$
\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\sum_i\mathbb{P}(\boldsymbol{x}|\mathcal{C}_i)\mathbb{P}(\mathcal{C}_i)}\tag{9}
$$

In this condition,if we set $a_i=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_i)\mathbb{P}(\mathcal{C}_i)}$, the whole fomular will be too complecated. To simplify the equation, we just set:

$$
a_i=\mathrm{ln} \mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)\tag{10}
$$

and we get a function of posterior probability:

$$
\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})=\frac{e^{a_k}}{\sum_i e^{a_i}}\tag{11}
$$

And according to the property of probability, we get the value of function:

$$
y(a)=\frac{e^{a_k}}{\sum_i e^{a_i}}\tag{12}
$$

belongs to interval $[0,1]$. And it is called sofemax function. Although, according to equation (10), the domain of definition of softmax function is $(-\infty,0]$, $a$ can be any real number. It’s called softmax, because it is a smooth versoin of max function.

When $a_k\gg a_j$ for $k\neq j$, we have:

$$
\begin{aligned}
\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)&\simeq1\\
\mathbb{P}(\boldsymbol{x}|\mathcal{C}_j)&\simeq0
\end{aligned}
$$

So both logistic sigmoid function and softmax function can be used to form generative classifiers, which gives a value to the decision step. Because they can generate a probability representing $\mathbb{P}(\mathcal{C}_j|\boldsymbol{x})$

References


1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Last modified: March 24, 2020