**Keywords:** AlexNet, Dropout, ReLU

All the figures and tables in this post come from ‘ImageNet Classification with Deep Convolutional Neural Networks’^{1}

## Basic Works

- Large training sets are available
- GPU
- Convolutional neural networks, such as ‘Handwritten digit recognition with back-propagation networks’
^{2}

## Inspiration

- Controlling the capacity of CNNs by varying their depth and breadth.
- CNNs can make strong and mostly correct assumptions about the nature of images.

## Contribution

- Training one of the largest convolutional neural networks to IMageNet ILSVRC-2010
- architecture

- architecture
- GPU is used
- Some new feature of CNNs improving its performance
- ReLU Nonlinearity $f(x)= max(0,x)$
- ReLU is much faster than $\tanh(x)$ or $\frac{1}{1+e^{-x}}$
- Jarrett et al.3 had tried different other nonlinear function such as $f(x)=|\tanh(x)|$ to prevent overfitting

- Local Response Normalization
- ReLU does not require input normalization to prevent them from saturating.
- Local response normalization scheme aids generalization:

$$

b^{i}_{x,y}=\frac{a^i_{x,y}}{(k+\alpha\sum^{\min(N-1,i+n/2)}_{j=\max{0,i-n/2}}(a^j_{x,y})^2)^\beta}

$$- where $a^i_{x,y}$ is the activity of a neuron computed by applying kernel $i$ at position $(x,y)$ and the applying the ReLU nonlinearity, the respons-normalized activity $b^i_{x,y}$. And $n$ means ‘adjacent’ kernel maps at the same spatial position and $N$ is the total number of kernels in the layer
- Hyper-parameters whose values are determined using a validation set, and $k=2$, $n=5$, $\alpha = 10^{-4}$ and $\beta =0.75$ is used
- normalization is used after ReLU
- 13% error without nomarlization and 11% with normalization

- Overlapping Pooling
- Traditionally neighborhoods summarized by adjacent pooling units do not overlap
- if using $stride = 2$ and $kernel\,size=3$ error rates by 0.4%

- ReLU Nonlinearity $f(x)= max(0,x)$
- preventing overfitting
- Data augmentation
- extracting random patches from a bigger image
- horizontal reflections
- in testing, test 5 patches as well as their horizontal reflections and averaging the prediction of all this prediction(softmax output)
- altering the intensities of the RGB channels in training image, PCA first:

$$

\begin{bmatrix}\boldsymbol{p}_1&\boldsymbol{p}_2&\boldsymbol{p}_3\end{bmatrix}

\begin{bmatrix}\alpha_1 \lambda_1\\\alpha_2 \lambda_2\\\alpha_3 \lambda_3\\\end{bmatrix}

$$- where $\boldsymbol{p}_i$ is eigenvector and $\lambda_i$ is eigenvalue and $\alpha_i$ is random variable. And to a particular training image $\alpha_i$ is drawn only once. And when the image will be used again, $\alpha_i$ could be updated
- dropout: select some neurons and set their output $0$ with probability 0.5. This means each time for the input data, they are using a different architecture.
- dropout is used in the first two full-connected layers in figure 2.
- without dropout the network exhibits substantial overfitting.
- dropout roughly doubles the number of iterations required to converge.

- Data augmentation

## Experiment

- batch size of 128 examples, momentum 0.9, and weight decay of 0.0005:

$$

\begin{aligned}

v_{i+1} &:= 0.9 \cdot v_i – 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\

w_{i+1} &:= w_i + v_{i+1}

\end{aligned}

$$

\begin{aligned}

v_{i+1} &:= 0.9 \cdot v_i – 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\

w_{i+1} &:= w_i + v_{i+1}

\end{aligned}

$$

- weight decay is no merely a regularizer, it reduces the model’s training error
- where $i$ is iteration index, $v$ is the momentum variable, $\epsilon$ is the learning rate, $(\frac{\partial L}{\partial w}|_{w_i})_{D_i}$ is the average gradient of batch $D_i$ with respect to $w$, evaluated at $w_i$

- equal learning rate for each layer.
- when the validation error rate stopped improving, divide the learning rate by $10$.
- learning rate initialized at $0.01$

## Result

## Personal Summary

This paper rebirth CNNs in 2012. GPUs are an important helper for training CNNs.

## References

1 Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. 2012. ↩

2 LeCun, Yann, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. “Handwritten digit recognition with a back-propagation network.” In Advances in neural information processing systems, pp. 396-404. 1990. ↩