Keywords: discriminant functions, decision boundary

Discriminant Function in Classification

The discriminant function or discriminant model is on the other side of ‘the generative model’. So we, here, have a look at the behave of discriminant function in linear classification.1

In the post ‘Least Squares Classification’, we have seen, in a linear classification task, the decision boundary is a line or hyperplane by which we tell two classes apart. And if our model is based on the decision boundary or, in other words, we partition inputs by a function and a threshold, the model is a discriminant model and the decision boundary can be formed by the function and the threshold.

Now, we are going to talk about how the decision boundaries look like in the $K$-classes problem when $K=2$ and $K>2$. To illustrate the boundaries, we only consider the 2D(two dimensional) input vector $\boldsymbol{x}$ who has only two components.

Two classes

The easiest decision boundary comes from 2-dimensional input space which is partitioned into 2 classes as:

whose decision boundary is:

$$
\boldsymbol{w}^T\boldsymbol{x}+w_0=\text{ constant }\tag{1}
$$

This equation is equal to $\boldsymbol{w}^T\boldsymbol{x}+w_0=0$ because $w_0$ is also a constant, so then they can be merged. Of course, the 1-dimensional input space is easier than 2-dimensional, but its decision boundary is a point that represents nothing.

Let’s go back to the line, and some properties of the linear decision boundary are:

  1. The vector $\boldsymbol{w}$ always points to a certain region and is perpendicular to the line.
  2. $w_0$ decides the location of the boundary relative to the origin.
  3. The perpendicular distance $r$ to the line of a point $\boldsymbol{x}$ can be calculated by $r=\frac{y(\boldsymbol{x})}{||\boldsymbol{w}||}$ where $y(\boldsymbol{x})=\boldsymbol{w}^T\boldsymbol{x}+w_0$

For these three properties are all basic concepts of a line, we just roughly prove the third point:

proof: We set $\boldsymbol{x}_{\perp}$ is the projection of $\boldsymbol{x}$ on the line.

We using the first point that $\boldsymbol{w}$ is perpendicular to the line and $\frac{\boldsymbol{w}}{||\boldsymbol{w}||}$ is the union vector:

$$
\boldsymbol{x}=\boldsymbol{x}_{\perp}+r\frac{\boldsymbol{w}}{||\boldsymbol{w}||}\tag{2}
$$

and we substitute equation (2) to the line function $y(\boldsymbol{x})=\boldsymbol{w}^T\boldsymbol{x}+w_0$ :

$$
\begin{aligned}
y(\boldsymbol{x})&=\boldsymbol{w}^T(\boldsymbol{x}_{\perp}+r\frac{\boldsymbol{w}}{||\boldsymbol{w}||})+w_0\\
&=\boldsymbol{w}^T\boldsymbol{x}_{\perp}+\boldsymbol{w}^Tr\frac{\boldsymbol{w}}{||\boldsymbol{w}||}+w_0\\
&=\boldsymbol{w}^Tr\frac{\boldsymbol{w}}{||\boldsymbol{w}||}\\
&=r\frac{||\boldsymbol{w}||^2}{||\boldsymbol{w}||}\\
\end{aligned}\tag{3}
$$

So we have

$$
r=\frac{y(\boldsymbol{x})}{||\boldsymbol{w}||}\tag{4}
$$

Q.E.D

However, augmented vectors $\boldsymbol{w}= \begin{bmatrix}w_0\\w_1\\ \vdots\\w_d\end{bmatrix}$ and $\boldsymbol{x}= \begin{bmatrix}1\\x_1\\ \vdots\\x_d\end{bmatrix}$ can cancel $w_0$ of the original boundary equation. So a $d+1$-dimensional hyperplane that goes through the origin could take the place of an arbitrary $d$-dimensional hyperplane.

Multiple Classes

Things changed when we consider more than 2 classes. Their boundaries become more complicated, and we have 3 different strategies for this problem intuitively:

1-versus-the-rest Classifier

This strategy needs at least $K-1$ classifiers(boundaries). Each classifier $k$ just decides which side belongs to class $k$ and the other side does not belong to $k$. So when we have two boundaries, the condition is like:

where the region $R_4$ is embarrassed, based on the properties of the decision boundary, and the definition of classification in the post‘From Linear Regression to Linear Classification’, region $R_4$ can not belong to $\mathcal{C}_1$ and $\mathbb{C}_2$ simultaneously.

So the first strategy can work for some regions, but there are some black whole region where the input $\boldsymbol{x}$ belongs to more than one class and some white whole region where the input $\boldsymbol{x}$ belongs no classes(region $R_3$ could be such a region)

1-versus-1 classifier

Another kind of multiple class boundary is the combination of several 1-versus-1 linear decision boundaries. Both sides of a decision boundary belong to a certain class, not like the 1-versus-rest classifier. And to a $K$ class task, it needs $K(K-1)/2$ binary discriminant functions.

However, the contradiction still exists. Region $R_4$ belongs to class $\mathcal{C}_1$, $\mathcal{C}_2$, and $\mathcal{C}_3$ simultaneously.

$K$ Linear functions

We us a set of $K$ linear functions:

$$
\begin{aligned}
y_1(\boldsymbol{x})&=\boldsymbol{w}^T_1\boldsymbol{x}+w_{10}\\
y_2(\boldsymbol{x})&=\boldsymbol{w}^T_2\boldsymbol{x}+w_{20}\\
&\vdots \\
y_K(\boldsymbol{x})&=\boldsymbol{w}^T_K\boldsymbol{x}+w_{K0}\\
\end{aligned}\tag{5}
$$

and an input belongs to $k$ when $y_k(\boldsymbol{x})>y_j(\boldsymbol{x})$ where $j\in \{1,2,\cdots,K\}$ and $j\neq k$. According to this definition, the decision boundary between class $k$ and class $j$ is $y_k(\boldsymbol{x})=y_j(\boldsymbol{x})$ where $k,j\in\{1,2,\cdots,K\}$ and $j\neq k$. Then a decision hyperplane is defined as:

$$
(\boldsymbol{w}_k-\boldsymbol{w}_j)^T\boldsymbol{x}+(w_{k0}-w_{j0})=0\tag{6}
$$

These decision boundaries separate the input spaces into $K$ single connect, convex regions.

proof:
choose two points in the region $k$ that $k\in \{1,2,\cdots,K\}$. $\boldsymbol{x}_A$ and $\boldsymbol{x}_B$ are two points in the region. An arbitrary point on the line between $\boldsymbol{x}_A$ and $\boldsymbol{x}_B$ can be written as $\boldsymbol{x}’=\lambda \boldsymbol{x}_A + (1-\lambda)\boldsymbol{x}_B$ where $0\leq\lambda\leq1$. For the linearity of $y_k(\boldsymbol{x})$ we have:

$$
y_k(\boldsymbol{x}’)=\lambda y_k(\boldsymbol{x}_A) + (1-\lambda)y_k(\boldsymbol{x}_B)\tag{7}
$$

Because $\boldsymbol{x}_A$ and $\boldsymbol{x}_B$ belong to class $k$, $y_k(\boldsymbol{x}_A)>y_j(\boldsymbol{x}_A)$ and $y_k(\boldsymbol{x}_B)>y_j(\boldsymbol{x}_B)$ where $j\neq k$. Then $y_k(\boldsymbol{x}’)>y_j(\boldsymbol{x}’)$ and the region of class $k$ is convex.

Q.E.D

The last strategy seems good. And what we should do is to estimate the parameters of the model. The most famous approaches that can be used to estimate the parameters of discriminant functions are:

  1. Least square
  2. Fisher’s linear discriminant
  3. Perceptron algorithm

 

References

 


1 Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Last modified: March 24, 2020