Keywords: Generative models

## Probabilistic Generative Models1

The generative model used for making decisions contains the inference step and decision step:

1. Inference step is using probability or other theory to calculate $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$ which means to a given $\boldsymbol{x}$ the probability of $\boldsymbol{x}$ belonging to the class $\mathcal{C}_k$
2. Decision step is to make a decision based on $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$ which was calculated in step 1

In this post, we just give an introduction and a framework for the probabilistic generative model in classification. And the details of how to estimating the parameters in the model will not be introduced.

## From Bayesian Formular to Logistic Sigmoid Function

To build $\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})$, we can start from Bayesian formula. To the class $\mathcal{C}_1$ of a two-classes problem, the posterior probability:

\begin{aligned} \mathbb{P}(\mathcal{C}_1|\boldsymbol{x})&=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)+\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}\\ &=\frac{1}{1+\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}} \end{aligned}\tag{1}

represents a new function:

\begin{aligned} \mathbb{P}(\mathcal{C}_1|\boldsymbol{x})&=\delta(a)\\ &=\frac{1}{1+e^{-a}} \end{aligned}\tag{2}

where:

$$a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}\tag{3}$$

An usual question is why we set $a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}$ but not $a=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_2)\mathbb{P}(\mathcal{C}_2)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_1)\mathbb{P}(\mathcal{C}_1)}$. In my opinion, this $a$ just determine the graph of function $\delta(a)$. However, we perfer monotone-increasing function and $\frac{1}{1+e^{-a}}$ is just a monotone-increasing function but $\frac{1}{1+e^{a}}$ is not.

$\delta(\cdot)$ is called logistic sigmoid function or squashing function, because it maps any number into interval $(0,1)$. The range of the function is just within the range of probability. So it is a good way to represent some kinds of probability, such as the $\mathbb{P}(\mathcal{C}_1|\boldsymbol{x})$. the shape of the logistic sigmoid function is:

## Some Properties of Logistic Sigmoid

For the logistic sigmoid function is symmetrical, then:

$$1-\delta(a)=\frac{e^{-a}}{1+e^{-a}}=\frac{1}{e^a+1}\tag{4}$$

and:

$$\delta(-a)=\frac{1}{1+e^a}\tag{5}$$

So, we have an important equation:

$$1-\delta(a)=\delta(-a)\tag{6}$$

The inverse function of $y=\delta(a)$ is:

\begin{aligned} y&=\frac{1}{1+e^{-a}}\\ e^{-a}&=\frac{1}{y}-1\\ a&=-\mathrm{ln}(\frac{1-y}{y})\\ a&=\mathrm{ln}(\frac{y}{1-y}) \end{aligned}\tag{7}

The derivative of logistic sigmoid function is:

$$\frac{d\delta(a)}{d a}=\frac{e^{-a}}{(1+e^{-a})^2}=(1-\delta(a))\delta(a)\tag{8}$$

## Multiple Classes Problems

We, now, extend the logistic sigmoid function into multiple classes condition. And we also start from Bayesian formular:

$$\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\sum_i\mathbb{P}(\boldsymbol{x}|\mathcal{C}_i)\mathbb{P}(\mathcal{C}_i)}\tag{9}$$

In this condition,if we set $a_i=\mathrm{ln}\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_i)\mathbb{P}(\mathcal{C}_i)}$, the whole fomular will be too complecated. To simplify the equation, we just set:

$$a_i=\mathrm{ln} \mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)\tag{10}$$

and we get a function of posterior probability:

$$\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})=\frac{e^{a_k}}{\sum_i e^{a_i}}\tag{11}$$

And according to the property of probability, we get the value of function:

$$y(a)=\frac{e^{a_k}}{\sum_i e^{a_i}}\tag{12}$$

belongs to interval $[0,1]$. And it is called sofemax function. Although, according to equation (10), the domain of definition of softmax function is $(-\infty,0]$, $a$ can be any real number. It’s called softmax, because it is a smooth versoin of max function.

When $a_k\gg a_j$ for $k\neq j$, we have:

\begin{aligned} \mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)&\simeq1\\ \mathbb{P}(\boldsymbol{x}|\mathcal{C}_j)&\simeq0 \end{aligned}

So both logistic sigmoid function and softmax function can be used to form generative classifiers, which gives a value to the decision step. Because they can generate a probability representing $\mathbb{P}(\mathcal{C}_j|\boldsymbol{x})$

## References

1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.