The post [Review] ImageNet Classification with Deep Convolutional Neural Networks appeared first on Brain Bomb.

]]>

All the figures and tables in this post come from ‘ImageNet Classification with Deep Convolutional Neural Networks’^{1}

- Large training sets are available
- GPU
- Convolutional neural networks, such as ‘Handwritten digit recognition with back-propagation networks’
^{2}

- Controlling the capacity of CNNs by varying their depth and breadth.
- CNNs can make strong and mostly correct assumptions about the nature of images.

- Training one of the largest convolutional neural networks to IMageNet ILSVRC-2010
- architecture

- architecture
- GPU is used
- Some new feature of CNNs improving its performance
- ReLU Nonlinearity $f(x)= max(0,x)$
- ReLU is much faster than $\tanh(x)$ or $\frac{1}{1+e^{-x}}$
- Jarrett et al.3 had tried different other nonlinear function such as $f(x)=|\tanh(x)|$ to prevent overfitting

- Local Response Normalization
- ReLU does not require input normalization to prevent them from saturating.
- Local response normalization scheme aids generalization:

$$

b^{i}_{x,y}=\frac{a^i_{x,y}}{(k+\alpha\sum^{\min(N-1,i+n/2)}_{j=\max{0,i-n/2}}(a^j_{x,y})^2)^\beta}

$$- where $a^i_{x,y}$ is the activity of a neuron computed by applying kernel $i$ at position $(x,y)$ and the applying the ReLU nonlinearity, the respons-normalized activity $b^i_{x,y}$. And $n$ means ‘adjacent’ kernel maps at the same spatial position and $N$ is the total number of kernels in the layer
- Hyper-parameters whose values are determined using a validation set, and $k=2$, $n=5$, $\alpha = 10^{-4}$ and $\beta =0.75$ is used
- normalization is used after ReLU
- 13% error without nomarlization and 11% with normalization

- Overlapping Pooling
- Traditionally neighborhoods summarized by adjacent pooling units do not overlap
- if using $stride = 2$ and $kernel\,size=3$ error rates by 0.4%

- ReLU Nonlinearity $f(x)= max(0,x)$
- preventing overfitting
- Data augmentation
- extracting random patches from a bigger image
- horizontal reflections
- in testing, test 5 patches as well as their horizontal reflections and averaging the prediction of all this prediction(softmax output)
- altering the intensities of the RGB channels in training image, PCA first:

$$

\begin{bmatrix}\boldsymbol{p}_1&\boldsymbol{p}_2&\boldsymbol{p}_3\end{bmatrix}

\begin{bmatrix}\alpha_1 \lambda_1\\\alpha_2 \lambda_2\\\alpha_3 \lambda_3\\\end{bmatrix}

$$- where $\boldsymbol{p}_i$ is eigenvector and $\lambda_i$ is eigenvalue and $\alpha_i$ is random variable. And to a particular training image $\alpha_i$ is drawn only once. And when the image will be used again, $\alpha_i$ could be updated
- dropout: select some neurons and set their output $0$ with probability 0.5. This means each time for the input data, they are using a different architecture.
- dropout is used in the first two full-connected layers in figure 2.
- without dropout the network exhibits substantial overfitting.
- dropout roughly doubles the number of iterations required to converge.

- Data augmentation

- batch size of 128 examples, momentum 0.9, and weight decay of 0.0005:

$$

\begin{aligned}

v_{i+1} &:= 0.9 \cdot v_i – 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\

w_{i+1} &:= w_i + v_{i+1}

\end{aligned}

$$

\begin{aligned}

v_{i+1} &:= 0.9 \cdot v_i – 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\

w_{i+1} &:= w_i + v_{i+1}

\end{aligned}

$$

- weight decay is no merely a regularizer, it reduces the model’s training error
- where $i$ is iteration index, $v$ is the momentum variable, $\epsilon$ is the learning rate, $(\frac{\partial L}{\partial w}|_{w_i})_{D_i}$ is the average gradient of batch $D_i$ with respect to $w$, evaluated at $w_i$

- equal learning rate for each layer.
- when the validation error rate stopped improving, divide the learning rate by $10$.
- learning rate initialized at $0.01$

This paper rebirth CNNs in 2012. GPUs are an important helper for training CNNs.

The post [Review] ImageNet Classification with Deep Convolutional Neural Networks appeared first on Brain Bomb.

]]>The post [Review] Learning algorithms for classification-a comparison on handwritten digit recognition appeared first on Brain Bomb.

]]>

All the figures in this post come from ‘Learning algorithms for classification-a comparison on handwritten digit recognition’^{1}

- Convolutional network
^{2}

- Raw accuracy, training time, recognition time and memory requirements should be considered in classification.
- From experiments and comparison, the results can illuminate which one is better
- Selected competitors(Baseline)
- Linear Classifier
- Nearest Neighbor Classifier
- Large Fully Connected Multi-Layer Neural Network
- LeNet(1,4,5)
- Boosted LeNet 4
- Tangent Distance Classifier(TDC)
- LeNet 4 with K-Nearest Neighbors
- Local Learning with LeNet 4
- Optimal Margin Classifier(OMC)

- Selected competitors(Baseline)

- Listing some data set that was used in recognition
- Details in comparison:
- large fully connected multi-layer neural network
- It is over-parameterized but still works well – some build-in “self-regularization” mechanism. This is due to the nature of the error surface, gradient descent training invariably goes through a phase where the weights are small. Small weights cause the sigmoid(activation function) to operate in the quasi-linear region, making the network essentially equivalent to a low-capacity, single-layer network. (need more empirical evidence)

- LeNet 1
- Convolutional neural network
- first few layers:
- local ‘receptive field’
- the output of the convolution is called ‘feature map’
- followed by a squashing function

- share a single weight vector(weight sharing technique)
- reduce the number of free parameters
- shift-invariance
- need multiple feature maps, extracting different features types from the same image
- weights are trained by gradient descent

- local, convolutional feature maps in hidden layers
- increasing complexity and abstraction
- higher-level features require less precise coding of their location
- local averaging and subsampling is used to reducing the resolution of the feature map
- invariance to distortions
- the resulting architecture is a ‘bi-pyramid’

- 1.7% error in the MNIST test

- LeNet 4

- expanded version of LeNet 1 input(28×28 to 32×32)
- 1.1% error in MNIST test

- LeNet 5

- more feature maps
- a large fully-connected layer
- a distributed representation to encode the categories at the output layer rather than “1 of N”
- 0.9% error in the MNIST test

- Boosted LeNet 4

- insufficient data to train 3 models
- affine transformation and line-thickness variation to augment the training set
- 0.7% error in the MNIST test

- large fully connected multi-layer neural network

This paper is another paper of LeCun after he trained the convolutional neural network by BP in 1990. And the experiment is the kernel in this paper. And some techniques mentioned in this paper are still employed today, such as training data augment. In 1995, LeCun had tried to improve the CNNs by different activation functions, different numbers of feature maps in each layer, different training data, and different combining methods. And he concluded as the training databases growing the CNNs will become more striking.

The post [Review] Learning algorithms for classification-a comparison on handwritten digit recognition appeared first on Brain Bomb.

]]>The post [Combining Models] Boosting and AdaBoost appeared first on Brain Bomb.

]]>The committee has an equal weight for every prediction from all models, and it gives little improvement than a single model. Then boosting was built for this problem. Boosting is a technique for combining multiple ‘base’ classifiers to produce a form of the committee that:

- performances better than any of base classifier and
- each base classifier has a different weight factor

Adaboost is short for adaptive boosting. It is a method combining several weak classifiers which are just better than random guesses and it gives a better performance than the committee. The base classifiers in AdaBoost are trained in sequence, and their training set is the same but with different weights. So if we consider the training data distribution, the distribution faced by every weak classifier is different. **This might be an important reason for the improvement of AdaBoost from the committee**. And the weights for weak classifiers are generated depending on the performance of the previous classifier. Then the weak classifiers in the algorithm would be trained one by one. During the prediction process, the input data flows from classifier to classifier and the final result is some kind of combination of all output of weak classifiers.

Key points of the AdaBoost algorithm are:

- the data points are predicted incorrectly in current classifier gives a greater weight
- once the algorithm was trained, prediction of each classifier is combined through a weighted majority voting scheme as:

where $w_n^{(1)}$ is the initial weights of input data of the $1$st weak classifier, $y_1(x)$ is the prediction of the $1$st weak classifier, $\alpha_m$ is the weight of each prediction(notably this weight belongs to $y_m(x)$ and $w_n^{(1)}$ belongs to input data of the first classifier.). And the final output is the sign function of the weighted sum of all predictions.

The procedure of algorithm is:

- Initial data weighting coefficients $\{\boldsymbol{w}_n\}$ by $w_n^{(1)}=\frac{1}{N}$ for $n=1,2,\cdots,N$
- For $m=1,\dots,M$:

- Fit a classifier $y_m(\boldsymbol{x})$ to training set by minimizing the weighted error function:

$$J_m=\sum_{}^{}w_n^{(m)}I(y_m(\boldsymbol{x}_n)\neq t_n)$$

where $I(y_m(\boldsymbol{x})\neq t_n)$is the indicator function and equals 1 when $y_m(\boldsymbol{x})\neq t_n$ and 0 otherwise- Evaluate the quatities:

$$\epsilon_m=\frac{\sum_{n=1}^Nw_n^{(m)}I(y_m(\boldsymbol{x})\neq t_n)}{\sum_{n=1}^{N}w_n^{(m)}}$$

and then use this to evaluate $\alpha_m=\ln \{\frac{1-\epsilon_m}{\epsilon_m}\}$- Updata the data weighting coefficients:

$$w_n^{(m+1)}=w_n^{(m)}\exp\{\alpha_mI(y_m(\boldsymbol{x})\neq t_n)\}$$

- Make predictions using the final model, which is given by:

$$Y_M = \mathrm{sign} (\sum_{m=1}^{M}\alpha_my_m(x))$$

This procedure comes from ‘Pattern recognition and machine learning’^{1}

# weak classifier # test each dimension and each value and each direction to find a # best threshold and direction('<' or '>') class Stump(): def __init__(self): self.feature = 0 self.threshold = 0 self.direction = '<' def loss(self,y_hat, y, weights): """ :param y_hat: prediction :param y: target :param weights: weight of each data :return: loss """ sum = 0 example_size = y.shape[0] for i in range(example_size): if y_hat[i] != y[i]: sum += weights[i] return sum def test_in_traing(self, x, feature, threshold, direction='<'): """ test during training :param x: input data :param feature: classification on which dimension :param threshold: threshold :param direction: '<' or '>' to threshold :return: classification result """ example_size = x.shape[0] classification_result = -np.ones(example_size) for i in range(example_size): if direction == '<': if x[i][feature] < threshold: classification_result[i] = 1 else: if x[i][feature] > threshold: classification_result[i] = 1 return classification_result def test(self,x): """ test during prediction :param x: input :return: classification result """ return self.test_in_traing(x, self.feature, self.threshold, self.direction) def training(self, x, y, weights): """ main training process :param x: input :param y: target :param weights: weights :return: none """ example_size = x.shape[0] example_dimension = x.shape[1] loss_matrix_less = np.zeros(np.shape(x)) loss_matrix_more = np.zeros(np.shape(x)) for i in range(example_dimension): for j in range(example_size): results_ji_less = self.test_in_traing(x, i, x[j][i], '<') results_ji_more = self.test_in_traing(x, i, x[j][i], '>') loss_matrix_less[j][i] = self.loss(results_ji_less, y, weights) loss_matrix_more[j][i] = self.loss(results_ji_more, y, weights) loss_matrix_less_min = np.min(loss_matrix_less) loss_matrix_more_min = np.min(loss_matrix_more) if loss_matrix_less_min > loss_matrix_more_min: minimum_position = np.where(loss_matrix_more == loss_matrix_more_min) self.threshold = x[minimum_position[0][0]][minimum_position[1][0]] self.feature = minimum_position[1][0] self.direction = '>' else: minimum_position = np.where(loss_matrix_less == loss_matrix_less_min) self.threshold = x[minimum_position[0][0]][minimum_position[1][0]] self.feature = minimum_position[1][0] self.direction = '<' class Adaboost(): def __init__(self, maximum_classifier_size): self.max_classifier_size = maximum_classifier_size self.classifiers = [] self.alpha = np.ones(self.max_classifier_size) def training(self, x, y, classifier_class): """ training adaboost main steps :param x: input :param y: target :param classifier_class: what can classifier would be used, here we use stump above :return: none """ example_size = x.shape[0] weights = np.ones(example_size)/example_size for i in range(self.max_classifier_size): classifier = classifier_class() classifier.training(x, y, weights) test_res = classifier.test(x) indicator = np.zeros(len(weights)) for j in range(len(indicator)): if test_res[j] != y[j]: indicator[j] = 1 cost_function = np.sum(weights*indicator) epsilon = cost_function/np.sum(weights) self.alpha[i] = np.log((1-epsilon)/epsilon) self.classifiers.append(classifier) weights = weights * np.exp(self.alpha[i]*indicator) def predictor(self, x): """ prediction :param x: input data :return: prediction result """ example_size = x.shape[0] results = np.zeros(example_size) for i in range(example_size): y = np.zeros(self.max_classifier_size) for j in range(self.max_classifier_size): y[j] = self.classifiers[j].test(x[i].reshape(1,-1)) results[i] = np.sign(np.sum(self.alpha*y)) return results

the entire project can be found https://github.com/Tony-Tan/ML. And please star me! Thanks!

When we use different numbers of classifiers, the results of the algorithm are like:

where the blue circles are correct classification of class 1 and red circles are correct classification of class 2. And the blue crosses belong to class 2 but were classified into class 1, and so do the red crosses.

A 40-classifiers AdaBoost gives a relatively good prediction:

where there is only one misclassification point.

The post [Combining Models] Boosting and AdaBoost appeared first on Brain Bomb.

]]>The post [Combining Models] Committees appeared first on Brain Bomb.

]]>The committee is a native inspiration for how to combine several models(or we can say how to combine the outputs of several models). For example, we can combine all the models by:

$$

y_{COM}(X)=\frac{1}{M}\sum_{m=1}^My_m(X)\tag{1}

$$

y_{COM}(X)=\frac{1}{M}\sum_{m=1}^My_m(X)\tag{1}

$$

However, we want to analyze whether this average prediction of models is good than a single one of them.

To compare the committee and a single model, we should first build a criterion depending on which we can distinguish which model is better. Assuming that the true generator of the training data $x$ is:

$$

h(x)\tag{2}

$$

h(x)\tag{2}

$$

So our prediction of the $m$th model for $m=1,2\cdots,M$ can be represented as:

$$

y_m(x) = h(x) +\epsilon_m(x)\tag{3}

$$

y_m(x) = h(x) +\epsilon_m(x)\tag{3}

$$

then the average sum-of-square of error can be a nice criterion. And the criterion of sigle model is:

$$

\mathbb{E}_x[(y_m(x)-h(x))^2] = \mathbb{E}_x[\epsilon_m(x)^2] \tag{4}

$$

\mathbb{E}_x[(y_m(x)-h(x))^2] = \mathbb{E}_x[\epsilon_m(x)^2] \tag{4}

$$

where the $\mathbb{E}[\cdot]$ is the frequentist expectation. To make the creterion more concrete, we consider the average of the error over $M$ models:

$$

E_{AV} = \frac{1}{M}\sum_{m=1}^M\mathbb{E}_x[\epsilon_m(x)^2]\tag{5}

$$

E_{AV} = \frac{1}{M}\sum_{m=1}^M\mathbb{E}_x[\epsilon_m(x)^2]\tag{5}

$$

And on the other hand, the committees have the error given by equation(1), (3) and (4):

$$

\begin{aligned}

E_{COM}&=\mathbb{E}_x[(\frac{1}{M}\sum_{m=1}^My_m(x)-h(x))^2] \\

&=\mathbb{E}_x[\{\frac{1}{M}\sum_{m=1}^M\epsilon_m(x)\}^2]

\end{aligned} \tag{6}

$$

\begin{aligned}

E_{COM}&=\mathbb{E}_x[(\frac{1}{M}\sum_{m=1}^My_m(x)-h(x))^2] \\

&=\mathbb{E}_x[\{\frac{1}{M}\sum_{m=1}^M\epsilon_m(x)\}^2]

\end{aligned} \tag{6}

$$

Now we assume that the random variables $\epsilon_i(x)$ for $i=1,2,\cdots,M$ have **mean 0** and **uncorrelated**, so that:

$$

\begin{aligned}

\mathbb{E}_x[\epsilon_m(x)]&=0 &\\

\mathbb{E}_x[\epsilon_m(x)\epsilon_l(x)]&=0,&m\neq l

\end{aligned} \tag{7}

$$

\begin{aligned}

\mathbb{E}_x[\epsilon_m(x)]&=0 &\\

\mathbb{E}_x[\epsilon_m(x)\epsilon_l(x)]&=0,&m\neq l

\end{aligned} \tag{7}

$$

Then subsitute equ (7) into equ (6), we can get:

$$

E_{COM}=\frac{1}{M^2}\mathbb{E}_x[\epsilon_m(x)]\tag{8}

$$

E_{COM}=\frac{1}{M^2}\mathbb{E}_x[\epsilon_m(x)]\tag{8}

$$

According the equation (5) and (8):

$$

E_{AV}=\frac{1}{M}E_{COM}\tag{9}

$$

E_{AV}=\frac{1}{M}E_{COM}\tag{9}

$$

**All the mathematics above is based on the assumption that the error of each model is uncorrelated**. However, most time they are highly correlated and the reduction of error is generally small. But the relation:

$$

E_{COM}\leq E_{AV}\tag{10}

$$

E_{COM}\leq E_{AV}\tag{10}

$$

exists for sure. Then the boosting method was established to make the combining model more powerful.

The post [Combining Models] Committees appeared first on Brain Bomb.

]]>The post [Combining Models] Bayesian Model Averaging(BMA) and Combining Models appeared first on Brain Bomb.

]]>Bayesian model averaging(BMA) is another wildly used method which is very like a combining model. However, the difference between BMA and combining models is significant.

A Bayesian model averaging is a Bayesian formula in which the random variable are models(hypothesises) $h=1,2,\cdots,H$ with prior probability $p(h)$, then the marginal distribution over data $X$ is:

$$

p(X)=\sum_{h=1}^{H}p(X|h)p(h)

$$

p(X)=\sum_{h=1}^{H}p(X|h)p(h)

$$

And the MBA is used to select a model(hypothesis) that can model the data best through Bayesian theory. When we have a larger size of $X$, the posterior probability

$$

p(h|X)=\frac{p(X|h)p(h)}{\sum_{i=1}^{H}p(X|i)p(i)}

$$

p(h|X)=\frac{p(X|h)p(h)}{\sum_{i=1}^{H}p(X|i)p(i)}

$$

become sharper. Then we got a good hypothesis.

In post ‘Mixtures of Gaussians’, we have seen how a mixture of Gaussians works. Then joint distribution of input data $\boldsymbol{x}$ and latent varible $\boldsymbol{z}$ is:

$$

p(\boldsymbol{x},\boldsymbol{z})

$$

p(\boldsymbol{x},\boldsymbol{z})

$$

and the margin distribution of $\boldsymbol{x}$ is

$$

p(\boldsymbol{x})=\sum_{\boldsymbol{z}}p(\boldsymbol{x},\boldsymbol{z})

$$

p(\boldsymbol{x})=\sum_{\boldsymbol{z}}p(\boldsymbol{x},\boldsymbol{z})

$$

For the mixture of Gaussians:

$$

p(\boldsymbol{x})=\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)

$$

p(\boldsymbol{x})=\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)

$$

the latent variable $\boldsymbol{z}$ is designed:

$$

p(z_k) = \pi_k

$$

p(z_k) = \pi_k

$$

for $k=\{1,2,\cdots,K\}$. And $z_k\in\{0,1\}$ is a $1$-of-$K$ representation.

Then this mixture of Gaussians is a king of combining models. Each time, only one $k$ is selected(for $\boldsymbol{z}$ is $1$-of-$K$ representation). An example of a mixture of Gaussians, and its original curve is like:

And the latent variables $\boldsymbol{z}$ separate the whole distribution into several Gaussian distributions:

This is the simplest model of combining models where each expert is a Gaussian model. And during the voting, only one model selected by $\boldsymbol{z}$ makes the final decision.

A combining model method contains several models and predicts by voting or other rules. However, Bayesian model averaging can be used to generate a hypothesis from several candidates.

The post [Combining Models] Bayesian Model Averaging(BMA) and Combining Models appeared first on Brain Bomb.

]]>The post [Combining Models] An Introduction to Combining Models appeared first on Brain Bomb.

]]>The mixture of Gaussians had been discussed in the post ‘Mixtures of Gaussians’. It can not only be used to introduce ‘EM algorithm’ but contain a strategy to improve model performance. All models we have studied, beside neural networks, are all single-distribution models. This is like that, to solve a problem we invite an expert who is very good at the problem, and we just do what the expert said. However, if our problem is too hard that no expert can deal with it by himself, it is spontaneous to think about how about inviting more experts. This inspiration gives a new way to improve performance by combining multiple models but not just by improving the performance of a single model.

A naive idea is voting by several models equally, which means averaging all predication of all models. However, different models have different abilities, voting equally is not a good idea. Then boosting and other methods were introduced.

In some combining methods, such as AdaBoost(boosting), bootstrap, bagging, and e.t.c, the input data has an identical distribution with the training set. However, in some methods, the training set is cut into several subsets with different distribution with the original training set. The decision tree is such a method. A decision tree is a sequence of binary selection and it can be employed in both regression and classification tasks.

We will briefly discuss:

in the following posts.

The post [Combining Models] An Introduction to Combining Models appeared first on Brain Bomb.

]]>The post [Mixture Models] EM Algorithm appeared first on Brain Bomb.

]]>Maximizing likelihood could not be used to the Gaussian mixture model directly, for its severe defects that we have come across at ‘Maximum Likelihood of Gaussian Mixtures’. By the inspiration of K-means, a two-step algorithm was developed.

The objective function is the log-likelihood function:

$$

\begin{aligned}

\ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{j=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\

&=\sum_{n=1}^{N}\ln \sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)\\

\end{aligned}\tag{1}

$$

\begin{aligned}

\ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{j=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\

&=\sum_{n=1}^{N}\ln \sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)\\

\end{aligned}\tag{1}

$$

The condition that must be satisfied at a maximum of log-likelihood is the derivative(partial derivative) of parameters are 0. So we should calculate the partial derivatives of $\mu_k$:

$$

\begin{aligned}

\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \mu_k}&=\sum_{n=1}^N\frac{-\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\\

&=-\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k)

\end{aligned}\tag{2}

$$

\begin{aligned}

\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \mu_k}&=\sum_{n=1}^N\frac{-\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\\

&=-\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k)

\end{aligned}\tag{2}

$$

and then set equation (2) equal to 0 and rearrange it as:

$$

\begin{aligned}

\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{x}_n&=\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{\mu}_k

\end{aligned}\tag{3}

$$

\begin{aligned}

\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{x}_n&=\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{\mu}_k

\end{aligned}\tag{3}

$$

in the post ‘Mixtures of Gaussians’, we had defined:

$$

\gamma_{nk}=p(k=1|\boldsymbol{x}_n)=\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\tag{4}

$$

\gamma_{nk}=p(k=1|\boldsymbol{x}_n)=\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\tag{4}

$$

called reposibility. And substitute equation(4) into equation(3):

$$

\begin{aligned}

\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\sum_{n=1}^N\gamma_{nk}\boldsymbol{\mu}_k\\

\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\boldsymbol{\mu}_k\sum_{n=1}^N\gamma_{nk}\\

{\mu}_k&=\frac{\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n}{\sum_{n=1}^N\gamma_{nk}}

\end{aligned}\tag{5}

$$

\begin{aligned}

\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\sum_{n=1}^N\gamma_{nk}\boldsymbol{\mu}_k\\

\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\boldsymbol{\mu}_k\sum_{n=1}^N\gamma_{nk}\\

{\mu}_k&=\frac{\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n}{\sum_{n=1}^N\gamma_{nk}}

\end{aligned}\tag{5}

$$

and to simpilify equation (5) we define:

$$

N_k = \sum_{n=1}^N\gamma_{nk}\tag{6}

$$

N_k = \sum_{n=1}^N\gamma_{nk}\tag{6}

$$

Then the equation (5) can be simplified as:

$$

{\mu}_k=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n\tag{7}

$$

{\mu}_k=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n\tag{7}

$$

The same calcualtion would be done to $\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \Sigma_k}=0$ :

$$

\Sigma_k = \frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}(\boldsymbol{x}_n – \boldsymbol{\mu_k})(\boldsymbol{x}_n – \boldsymbol{\mu_k})^T\tag{8}

$$

\Sigma_k = \frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}(\boldsymbol{x}_n – \boldsymbol{\mu_k})(\boldsymbol{x}_n – \boldsymbol{\mu_k})^T\tag{8}

$$

However, the situation of $\pi_k$ is a little complex, for it has a constrain:

$$

\sum_k^K \pi_k = 1 \tag{9}

$$

\sum_k^K \pi_k = 1 \tag{9}

$$

then Lagrange multiplier is employed and the objective function is:

$$

\ln p(X|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)+\lambda (\sum_k^K \pi_k-1)\tag{10}

$$

\ln p(X|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)+\lambda (\sum_k^K \pi_k-1)\tag{10}

$$

and set the partial derivative of equation (10) to $\pi_k$ to 0:

$$

0 = \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\tag{11}

$$

0 = \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\tag{11}

$$

And multiply both sides by $\pi_k$ and sum over $k$:

$$

\begin{aligned}

0 &= \sum_{k=1}^K(\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\pi_k)\\

0&=\sum_{k=1}^K\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\sum_{k=1}^K\lambda\pi_k\\

0&=\sum_{n=1}^N\sum_{k=1}^K\gamma_{nk}+\lambda\sum_{k=1}^K\pi_k\\

\lambda &= -N

\end{aligned}\tag{12}

$$

\begin{aligned}

0 &= \sum_{k=1}^K(\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\pi_k)\\

0&=\sum_{k=1}^K\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\sum_{k=1}^K\lambda\pi_k\\

0&=\sum_{n=1}^N\sum_{k=1}^K\gamma_{nk}+\lambda\sum_{k=1}^K\pi_k\\

\lambda &= -N

\end{aligned}\tag{12}

$$

the last step of equation(12) is because $\sum_{k=1}^K\pi_k=1$ and $\sum_{k=1}^K\gamma_{nk}=1$

Then we substitute equation(12) into eqa(11):

$$

\begin{aligned}

0 &= \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}-N\\

N &= \frac{1}{\pi_k}\sum_{n=1}^N\gamma_{nk}\\

\pi_k&=\frac{N_k}{N}

\end{aligned}\tag{13}

$$

\begin{aligned}

0 &= \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}-N\\

N &= \frac{1}{\pi_k}\sum_{n=1}^N\gamma_{nk}\\

\pi_k&=\frac{N_k}{N}

\end{aligned}\tag{13}

$$

the last step of equation (13) is because of the definition of equation (6).

Equation (5), (8) and (13) could not constitute a closed-form solution. The reason is that for example in equation (5), both side of the equation contains parameter $\mu_k$.

However, the equations suggest an iterative scheme for finding a solution which includes two-step: expectation and maximization:

- E step: calculating the posterior probability of equation (4) with the current parameter
- M step: update parameters by equation (5), (8) and (13)

The initial value of the parameters could be randomly selected. But some other tricks are always used, such as K-means. And the stop conditions can be one of:

- increase of log-likelihood falls below some threshold
- change of parameters falls below some threshold.

The input data should be normalized as what we did in ‘K-means algorithm’

def Gaussian( x, u, variance): k = len(x) return np.power(2*np.pi, -k/2.)*np.power(np.linalg.det(variance), -1/2)*np.exp(-0.5*(x-u).dot(np.linalg.inv(variance)).dot((x-u).transpose())) class EM(): def mixed_Gaussian(self,x,pi,u,covariance): res = 0 for i in range(len(pi)): res += pi[i]*Gaussian(x,u[i],covariance[i]) return res def clusturing(self, x, d, initial_method='K_Means'): data_dimension = x.shape[1] data_size = x.shape[0] if initial_method == 'K_Means': km = k_means.K_Means() # k_means initial mean vector, each row is a mean vector's transpose centers, cluster_for_each_point = km.clusturing(x, d) # initial latent variable pi pi = np.ones(d)/d # initial covariance covariance = np.zeros((d,data_dimension,data_dimension)) for i in range(d): covariance[i] = np.identity(data_dimension)/10.0 # calculate responsibility responsibility = np.zeros((data_size,d)) log_likelihood = 0 log_likelihood_last_time = 0 for dummy in range(1,1000): log_likelihood_last_time = log_likelihood # E step: # points in each class k_class_dict = {i: [] for i in range(d)} for i in range(data_size): responsibility_numerator = np.zeros(d) responsibility_denominator = 0 for j in range(d): responsibility_numerator[j] = pi[j]*Gaussian(x[i],centers[j],covariance[j]) responsibility_denominator += responsibility_numerator[j] for j in range(d): responsibility[i][j] = responsibility_numerator[j]/responsibility_denominator # M step: N_k = np.zeros(d) for j in range(d): for i in range(data_size): N_k[j] += responsibility[i][j] for i in range(d): # calculate mean # sum of responsibility multiply x sum_r_x = 0 for j in range(data_size): sum_r_x += responsibility[j][i]*x[j] if N_k[i] != 0: centers[i] = 1/N_k[i]*sum_r_x # covariance # sum of responsibility multiply variance sum_r_v = np.zeros((data_dimension,data_dimension)) for j in range(data_size): temp = (x[j]-centers[i]).reshape(1,-1) temp_T = (x[j]-centers[i]).reshape(-1,1) sum_r_v += responsibility[j][i]*(temp_T.dot(temp)) if N_k[i] != 0: covariance[i] = 1 / N_k[i] * sum_r_v # latent pi pi[i] = N_k[i]/data_size log_likelihood = 0 for i in range(data_size): log_likelihood += np.log(self.mixed_Gaussian(x[i], pi, centers, covariance)) if np.abs(log_likelihood - log_likelihood_last_time)<0.001: break print(log_likelihood_last_time) return pi,centers,covariance

The entire project can be found https://github.com/Tony-Tan/ML and please star me.

The progress of EM(initial with K-means):

and the final result is:

where the ellipse represents the covariance matrix and the axes of the ellipse are in the direction of eigenvectors of the covariance matrix, and their length is corresponding eigenvalues.

The post [Mixture Models] EM Algorithm appeared first on Brain Bomb.

]]>The post [Mixture Models] Maximum Likelihood of Gaussian Mixtures appeared first on Brain Bomb.

]]>Gaussian mixtures had been discussed in ‘Mixtures of Gaussians’. And once we have training data and a certain hypothesis, what we should do next is estimating the parameters of the model. Both kinds of parameters from a mixture of Gaussians

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}

$$

and latent variables:

$$

\boldsymbol{z}\tag{2}

$$

\boldsymbol{z}\tag{2}

$$

we defined, need to estimate. However, when we investigated the linear model ‘Maximum Likelihood Estimation’, we considered the prediction $y$ of the linear model has a Gaussian distribution, and then maximum likelihood was used to derive parameters in the model. So, we could employ the maximum likelihood again in the Gaussian mixture task, and find whether it could work well.

Firstly, we should prepare the notations that will be used in following analysis:

- input data $\{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_N\}$ for $\boldsymbol{x}_i\in \mathbb{R}^D$ and $i=\{1,2,\cdots,N\}$ and assuming they are i.i.d. Rearranging them in a matrix:

$$

X = \begin{bmatrix}

-&\boldsymbol{x}_1^T&-\\

-&\boldsymbol{x}_2^T&-\\

&\vdots&\\

-&\boldsymbol{x}_N^T&-\\

\end{bmatrix}\tag{3}

$$

X = \begin{bmatrix}

-&\boldsymbol{x}_1^T&-\\

-&\boldsymbol{x}_2^T&-\\

&\vdots&\\

-&\boldsymbol{x}_N^T&-\\

\end{bmatrix}\tag{3}

$$

- Latent varibales $\boldsymbol{z}_i$, the assistant random varible to $\boldsymbol{x}_i$ for $i\in\{1,\cdots,N\}$. And according to matrix (3), the matrix of latent varibales is

$$

Z = \begin{bmatrix}

-&\boldsymbol{z}_1^T&-\\

-&\boldsymbol{z}_2^T&-\\

&\vdots&\\

-&\boldsymbol{z}_N^T&-\\

\end{bmatrix}\tag{4}

$$

Z = \begin{bmatrix}

-&\boldsymbol{z}_1^T&-\\

-&\boldsymbol{z}_2^T&-\\

&\vdots&\\

-&\boldsymbol{z}_N^T&-\\

\end{bmatrix}\tag{4}

$$

Once we got these two matrices, basing on the equation (1) in ‘Mixtures of Gaussians’ as following:

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{5}

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{5}

$$

the log-likelihood function is given by:

$$

\begin{aligned}

\ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\

&=\sum_{n=1}^{N}\ln \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\\

\end{aligned}\tag{6}

$$

\begin{aligned}

\ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\

&=\sum_{n=1}^{N}\ln \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\\

\end{aligned}\tag{6}

$$

This looks different from the single Gaussian model where the logarithm operates directly on $\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)$ who is an exponential function. However, the existence of summation in the logarithm makes the problem hard to solve.

And for the combination of the Gaussian mixture is arbitrary. We could have $K!$ equivalent solutions. So which one we get did not effect on our model.

The other differences from the single Gaussian model are the singularity of the covariance matrix and the condition $\boldsymbol{x}_n=\boldsymbol{\mu}_j$.

For in the Gaussian distribution, the covariance matrix must be able to be inverted. So in our following discussion, we assume all the covariance matrice are invisible. For simplicity we take $\Sigma_k=\delta_k^2 I$ where $I$ is identity matrix.

When a point in sample accidentally equals to the mean $\mu_j$, the Gaussin distribution of the random variable $\boldsymbol{x}_n$ is:

$$

\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\delta_j^2I)=\frac{1}{2\pi^{\frac{1}{2}}}\frac{1}{\delta_j}\tag{7}

$$

\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\delta_j^2I)=\frac{1}{2\pi^{\frac{1}{2}}}\frac{1}{\delta_j}\tag{7}

$$

When the variance $\delta_j\to 0$, this part goes to infinity and the whole algorithm failed.

This problem does not exist in a single Gaussian model, for the $\boldsymbol{x}_n-\boldsymbol{\mu}_j=0$ is not an exponent in its log-likelihood.

The maximum likelihood method is not suited for a Gaussian mixture model. Then we introduce the EM algorithm in the next post.

The post [Mixture Models] Maximum Likelihood of Gaussian Mixtures appeared first on Brain Bomb.

]]>The post [Mixture Models] Mixtures of Gaussians appeared first on Brain Bomb.

]]>We have introduced a mixture distribution in the post ‘An Introduction to Mixture Models’. And the example in that post was just two components Gaussian Mixture. However, in this post, we would like to talk about Gaussian mixtures formally. And it severs to motivate the expectation-maximization(EM) algorithm.

Gaussian mixture distribution can be writen as:

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}

$$

p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}

$$

where $\sum_{k=1}^K \pi_k =1$ and $0\leq \pi_k\leq 1$.

And then we introduce a random variable(vector) called latent varible(vector) $\boldsymbol{z}$, that each component:

$$

z_k\in\{0,1\}\tag{2}

$$

z_k\in\{0,1\}\tag{2}

$$

and $\boldsymbol{z}$ is a $1$-of-$K$ representation, which means there is one and only one component is $1$ and others are $0$. To build a joint distribution $p(\boldsymbol{x},\boldsymbol{z})$, we should build $p(\boldsymbol{x}|\boldsymbol{z})$ and $p(\boldsymbol{z})$ firstly. We define the distribution of $\boldsymbol{z}$, we found:

$$

p(z_k=1)=\pi_k\tag{3}

$$

p(z_k=1)=\pi_k\tag{3}

$$

is a good design, for $\{\pi_k\}$ for $k=1,\cdots,K$ meets the requirements of the probability distribution. And for the entire vector $\boldsymbol{z}$ equ(3) can be written as:

$$

p(\boldsymbol{z}) = \Pi_{k=1}^K \pi_k^{z_k}\tag{4}

$$

p(\boldsymbol{z}) = \Pi_{k=1}^K \pi_k^{z_k}\tag{4}

$$

And according to the definition of $p(\boldsymbol{z})$ we can get the condition distribution of $\boldsymbol{x}$ given $\boldsymbol{z}$. Under the condition $z_k=1$, we have:

$$

p(\boldsymbol{x}|z_k=1)=\mathcal{N}(\boldsymbol{x}|\mu_k,\Sigma_k)\tag{5}

$$

p(\boldsymbol{x}|z_k=1)=\mathcal{N}(\boldsymbol{x}|\mu_k,\Sigma_k)\tag{5}

$$

and then we can derive the vector form of condtional distribution:

$$

p(\boldsymbol{x}|\boldsymbol{z})=\Pi_{k=1}^{K}\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)^{z_k}\tag{6}

$$

p(\boldsymbol{x}|\boldsymbol{z})=\Pi_{k=1}^{K}\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)^{z_k}\tag{6}

$$

Once we have both the probability distribution of $\boldsymbol{z}$, $p(\boldsymbol{z})$ and conditional distribution of $\boldsymbol{x}$ given $\boldsymbol{z}$, $p(\boldsymbol{x}|\boldsymbol{z})$. And we can build joint distribution by multiplication principle:

$$

p(x,z) = p(\boldsymbol{z})\cdot p(\boldsymbol{x}|\boldsymbol{z})\tag{7}

$$

p(x,z) = p(\boldsymbol{z})\cdot p(\boldsymbol{x}|\boldsymbol{z})\tag{7}

$$

However, what we concern is still the distribution of $\boldsymbol{x}$. We can calculate $\boldsymbol{x}$ by simply:

$$

p(\boldsymbol{x}) = \sum_{j}p(x,z_j) = \sum_{\boldsymbol{j}}p(\boldsymbol{z_j})\cdot p(\boldsymbol{x}|\boldsymbol{z_j})\tag{8}

$$

p(\boldsymbol{x}) = \sum_{j}p(x,z_j) = \sum_{\boldsymbol{j}}p(\boldsymbol{z_j})\cdot p(\boldsymbol{x}|\boldsymbol{z_j})\tag{8}

$$

where $z_j$ is every possible value of random variable $z$

This is how latent variables construct mixture Gaussians. And this form is easy for us to analyze the distribution of a mixture model.

Bayesian formula can help us produce posterior. And the posterior probability of latent varibale $\boldsymbol{z}$ by equation (7) can be calculated:

$$

p(z_k=1|\boldsymbol{x})=\frac{p(z_k=1)p(\boldsymbol{x}|z_k=1)}{\sum_j^K p(z_j=1)p(\boldsymbol{x}|z_j=1)}\tag{9}

$$

p(z_k=1|\boldsymbol{x})=\frac{p(z_k=1)p(\boldsymbol{x}|z_k=1)}{\sum_j^K p(z_j=1)p(\boldsymbol{x}|z_j=1)}\tag{9}

$$

and substitute equation (3),(5) into equation (9) and we get:

$$

p(z_k=1|\boldsymbol{x})=\frac{\pi_k\mathcal{N}(\boldsymbol{x}|\mu_k,\Sigma_k)}{\sum^K_j \pi_j\mathcal{N}(\boldsymbol{x}|\mu_j,\Sigma_j)}\tag{10}

$$

p(z_k=1|\boldsymbol{x})=\frac{\pi_k\mathcal{N}(\boldsymbol{x}|\mu_k,\Sigma_k)}{\sum^K_j \pi_j\mathcal{N}(\boldsymbol{x}|\mu_j,\Sigma_j)}\tag{10}

$$

And $p(z_k=1|\boldsymbol{x})$ is also called reponsibility, and denoted as:

$$

\gamma(z_k)=p(z_k=1|\boldsymbol{x})\tag{11}

$$

\gamma(z_k)=p(z_k=1|\boldsymbol{x})\tag{11}

$$

The post [Mixture Models] Mixtures of Gaussians appeared first on Brain Bomb.

]]>The post [Mixture Models] K-means Clustering appeared first on Brain Bomb.

]]>Original form K-Means algorithm might be one of the most accessible algorithms in machine learning. And many books and courses started with it. However, if we convert the task which K-means dealt with into a more mathematical form, there would be more interesting aspects coming to us.

The first thing we should do before introducing the algorithm is to make the task clear. And a precise mathematical form is usually the best way.

Clustering is a kind of unsupervised learning task. So there is no correct nor incorrect solution. Clustering is similar to classification during predicting since the output of clustering and classification is discrete. However, during training classifiers, we always have a certain target corresponding to every input. On the contrary, clustering has no target at all, and what we have is only

$$

\{x_1,\cdots, x_N\}\tag{1}

$$

\{x_1,\cdots, x_N\}\tag{1}

$$

where $x_i\in\mathbb{R}^D$ for $i=1,2,\cdots,N$. And our mission is to separate the dataset into $K$ groups (where $K$ has been given before task)

An intuitive strategy of clustering based on two considerations:

- the distance between data points in the same group should be as small as possible.
- the distance between data points in the different groups should be as large as possible.

Basing on these two points, some concepts were formed. The first one is how to represent a group. We take

$$

\mu_i:i\in\{1,2,\cdots, K\}\tag{2}

$$

\mu_i:i\in\{1,2,\cdots, K\}\tag{2}

$$

as the prototype associated with $i$th group. A group always contains several points, and a spontaneous idea is using the center of all the points belonging to the group as its prototype. To represent which group $\boldsymbol{x}_i$ in equation (1) belongs to, an indicator is necessary, and 1-of-K coding scheme is used, where the indicator:

$$

r_{nk}\in\{0,1\}\tag{3}

$$

r_{nk}\in\{0,1\}\tag{3}

$$

for $k=1,2,\cdots,K$ representing the group number and $n = 1,2,\cdots,N$ denoting the number of sample point, and where $r_{nk}=1$ then $r_{nj}=0$ for all $j\neq k$.

A loss function is a good way to measure the quantity of our model during both training and testing stages. And in the clustering task loss function could not be used for we have no idea about loss. However, we can build another function that plays the same role as loss function and it is also the target of what we want.

According to the two base points above, we build our objective function:

$$

J=\sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}||\boldsymbol{x}_n-\mu_k||^2\tag{4}

$$

J=\sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}||\boldsymbol{x}_n-\mu_k||^2\tag{4}

$$

In this objective function, the distance is defined as Euclidean distance(However, other measurements of similarity could also be used). Then the mission is to minimize $J$ by finding some certain $\{r_{nk}\}$ and $\{\mu_k\}$

Now, let’s represent the famous K-Means algorithm. The method includes two steps:

- Minimising $J$ respect to $r_{nk}$ keeping $\mu_k$ fixed
- Minimising $J$ respect to $\mu_k$ keeping $r_{nk}$ fixed

In the first step, according to equation (4), the objective function is linear of $r_{nk}$. So there is a close solution. Then we set:

$$

r_{nk}=\begin{cases}

1&\text{ if } k=\mathop{argmin}_{j}||x_n-\mu_j||^2\\

0&\text{otherwise}

\end{cases}\tag{5}

$$

r_{nk}=\begin{cases}

1&\text{ if } k=\mathop{argmin}_{j}||x_n-\mu_j||^2\\

0&\text{otherwise}

\end{cases}\tag{5}

$$

And in the second step, $r_{nk}$ is fixed and we minimise objective function $J$. For it is quadratic, the minimume point is on the stationary point where:

$$

\frac{\partial J}{\partial \mu_k}=-\sum_{n=1}^{N}r_{nk}(x_n-\mu_k)=0\tag{6}

$$

\frac{\partial J}{\partial \mu_k}=-\sum_{n=1}^{N}r_{nk}(x_n-\mu_k)=0\tag{6}

$$

and we get:

$$

\mu_k = \frac{\sum_{n=1}^{N}r_{nk}x_n}{\sum_{n=1}^{N}r_{nk}}\tag{7}

$$

\mu_k = \frac{\sum_{n=1}^{N}r_{nk}x_n}{\sum_{n=1}^{N}r_{nk}}\tag{7}

$$

$r_{nk}$ is the total number of points from the sample $\{x_1,\cdots, x_N\}$ who belong to prototype $\mu_k$ or group $k$ at current step. And $\mu_k$ is just the average of all the points in the group $k$.

This two-step, which was calculated by equation (5),(7), would repeat until $r_{nk}$ and $\mu_k$ not change.

The K-means algorithm guarantees to converge because at every step the objective function $J$ is reduced. So when there is only one minimum, the global minimum, the algorithm must converge.

Most algorithms need their input data to obey some rules. To the K-means algorithm, we rescale the input data to mean 0 and variance 1. This is always done by

$$

x_n^{(i)} = \frac{x_n^{(i)}- \bar{x}^{(i)}}{\delta^{i}}

$$

x_n^{(i)} = \frac{x_n^{(i)}- \bar{x}^{(i)}}{\delta^{i}}

$$

where $x_n^{(i)}$ is the $i$th component of the $n$th data point, and $x_n$ comes from equ (1), $\bar{x}^{(i)}$ and $\delta^{i}$ is the $i$th mean and standard deviation

class K_Means(): """ input data should be normalized: mean 0, variance 1 """ def clusturing(self, x, K): """ :param x: inputs :param K: how many groups :return: prototype(center of each group), r_nk, which group k does the n th point belongs to """ data_point_dimension = x.shape[1] data_point_size = x.shape[0] center_matrix = np.zeros((K, data_point_dimension)) for i in range(len(center_matrix)): center_matrix[i] = x[np.random.randint(0, len(x)-1)] center_matrix_last_time = np.zeros((K, data_point_dimension)) cluster_for_each_point = np.zeros(data_point_size, dtype=np.int32) # -----------------------------------visualization----------------------------------- # the part can be deleted center_color = np.random.randint(0,1000, (K, 3))/1000. plt.scatter(x[:, 0], x[:, 1], color='green', s=30, marker='o', alpha=0.3) for i in range(len(center_matrix)): plt.scatter(center_matrix[i][0], center_matrix[i][1], marker='x', s=65, color=center_color[i]) plt.show() # ----------------------------------------------------------------------------------- while (center_matrix_last_time-center_matrix).all() != 0: # E step for i in range(len(x)): distance_to_center = np.zeros(K) for k in range(K): distance_to_center[k] = (center_matrix[k]-x[i]).dot((center_matrix[k]-x[i])) cluster_for_each_point[i] = int(np.argmin(distance_to_center)) # M step number_of_point_in_k = np.zeros(K) center_matrix_last_time = center_matrix center_matrix = np.zeros((K, data_point_dimension)) for i in range(len(x)): center_matrix[cluster_for_each_point[i]] += x[i] number_of_point_in_k[cluster_for_each_point[i]] += 1 for i in range(len(center_matrix)): if number_of_point_in_k[i] != 0: center_matrix[i] /= number_of_point_in_k[i] # -----------------------------------visualization----------------------------------- # the part can be deleted print(center_matrix) plt.cla() for i in range(len(center_matrix)): plt.scatter(center_matrix[i][0], center_matrix[i][1], marker='x', s=65, color=center_color[i]) for i in range(len(x)): plt.scatter(x[i][0], x[i][1], marker='o',s=30, color=center_color[cluster_for_each_point[i]],alpha=0.7) plt.show() # ----------------------------------------------------------------------------------- return center_matrix, cluster_for_each_point

and entir project can be found : https://github.com/Tony-Tan/ML and please star me(^_^).

We use a tool https://github.com/Tony-Tan/2DRandomSampleGenerater generating the input data from:

There are two classes from the brown circle and green circle. Then the K-means algorithm initial two prototypes, the centers of groups, randomly:

the two crosses represent the centers.

And then we iterate the two steps:

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Iteration 3 and 4 are unchanged, for both objective function value $J$ and parameters. Then the algorithm stoped. And different initial centers may have different convergence speed.

The post [Mixture Models] K-means Clustering appeared first on Brain Bomb.

]]>