AMMAI: Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Introduction:

DNN is powerful in computer vision, such as image classifier, detection. But its advantage is that it has too much parameters. If we can discard many parameters but keep the performance, we can save more memory usage. In this paper, we describe 3 stages to remove parameters.

Here is the brief summary of this method:

Stage 1: Pruning

If a weight is below a threshold, we set it to zero. Now we only need to keep the position of the weights which is not zero. Instead of keeping the absolute position, we store the difference between consecutive addresses. Here is the illustration:

Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16 model.

Stage 2: Quantization and weight sharing

Now we cluster the weights into different clusters. In this figure, instead of using 32-bit floating number to store a weight, we now only use 2-bit index, since we have only 4 clusters.