AMMAI

2016年6月21日星期二

Nonlinear dimensionality reduction by locally linear embedding

Introduction:
The paper introduces the method of nonlinear dimension reduction.

Method:

In this figure, the middle part is the points sampled from the hyper-plane in A.

If we flatten the points in B, we could get C.

螢幕快照 2016-03-28 下午8.28.06.png

Here we have 3 steps:

1. For each point, select K neighbors.

2. Find K weights which minimize the squared error between X and WX.

3. In low-dimensional space, find Y (which is corresponding feature of X) which minimize the squared error between Y and WY.

Faster RCNN

Introduction:

The work aims to boost the speed and the performance of fast-rcnn. The bottleneck of fast-rcnn is object proposal. So the main method will focus on region proposal.

Method:

Region Proposal Network (RPN)

For RPN, the input is CNN conv feature map, and the output is rectangular object proposals along with object scores.

In fast-rcnn, selective search is used, and RPN will replace it in faster-rcnn.

Here is the illustration.

Result Compared with Fast-rcnn:

With fewer proposals, the mAP still beat fast-rcnn, and the execution time is also much less than fast-rcnn. So faster-rcnn performs better than fast-rcnn on both speed and accuracy.

Deep neural networks for acoustic modeling in speech recognition

Introduction:
In speech recognition, GMM-HMM is strong before. But GMM has a big problem. If most points lie on the surface of a sphere, we can use little parameters to model that. But in GMM, it requires very large number of parameters. Now researchers replace GMM with DNN, and they have shown that DNN outperforms GMM.

Method:
Restricted Boltzmann machine (RBM)

1. First a GRBM is trained to model a window of frames of real-valued acoustic coefficients.

2. Then the states of the binary hidden units of the GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired.

3. Then the stack of RBMs is converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed connections.

4.Finally, a pre-trained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state of each HMM.

Experiment:

Sequence to Sequence – Video to Text

Introduction:
Given the input of sequences of video frames, can a machine outputs a sequence of words? Such kind of this problem is called video captioning. Current state-of-the-art work is using LSTM. In this paper, we combine CNN and LSTM, to generate sentences describing the video.

Method:
First, we extract the features of input frames, we apply AlexNet, VGG, and GoogleNet to do this work.

Then 2 LSTMs are used in second part. In this picture, the red one models the video sequence, and the green one models the word sequence.

The green one are given text input and video hidden representation, which will predict the next word.

The flow images uses optical flow features extracted between frames.

Dataset:

Here we have 3 datasets.

螢幕快照 2016-06-02 下午3.25.41.png

Experiment Result:

2016年5月19日星期四

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Introduction:

DNN is powerful in computer vision, such as image classifier, detection. But its advantage is that it has too much parameters. If we can discard many parameters but keep the performance, we can save more memory usage. In this paper, we describe 3 stages to remove parameters.

Here is the brief summary of this method:

Stage 1: Pruning

If a weight is below a threshold, we set it to zero. Now we only need to keep the position of the weights which is not zero. Instead of keeping the absolute position, we store the difference between consecutive addresses. Here is the illustration:

Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16 model.

Stage 2: Quantization and weight sharing

Now we cluster the weights into different clusters. In this figure, instead of using 32-bit floating number to store a weight, we now only use 2-bit index, since we have only 4 clusters.

Stage 3: Huffman Coding

The basic idea of Huffman Coding is that using few bits to represent high-frequency things.

Experiment:

1. Parameters reduced but retain the loss.

2. Statistics about compressing AlexNet

3. Speedup

Text Understanding from Scratch

Introduction:

When we process sentences, some NLP models extract semantic level feature, like word2vec or N-gram models. In this work, it encodes sentences to character-level feature, which performs better than the former feature.

ConvNet

The main component of ConvNet is convolutional module, which computes a 1-D convolution between input and output.

The idea is briefly illustrated in this figure.

Character Quantization

Given a sentence, we quantize each character using 1-of-m encoding, where m is the number of alphabets. It's very simple but it works like Braille, which helps blind people reading.

If a sentence is longer than L characters, we remove those exceeding characters.
We use

Model Design

We design two ConvNets, which both have 6 convolutional layers and 3 fc layers.
The difference is the frame size. Here is the illustration of model.

Data Augmentation

The size of text data is always annoying. We need to do data augmentation if we have no sufficient data. Here we replace some words in a sentence with their synonyms.

Dataset

Here we use 5 datasets to evaluate our method.

(1) DBpedia, which has 14 classes, 560K training , 70K testing
(2) Amazon reviews, which has 5 classes, 3M training and 650K testing.
(3) Yahoo! Answers, which has 10 classes, 1.4M training, 60K testing.
(4) AG's news corpus, which has 4 classes, 120K training, 7.6K testing.
(5) Sogou News, which has 5 classes, 360K training, 60K testing.

Here we show these the result of (1):

Here we show these the result of (2):

Here we show these the result of (3):

Here we show these the result of (4):

Here we show these the result of (5):

2016年5月12日星期四

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Introduction:
Most of works about face recognition is made up of 4 stages, detect, align, represent, classify. In the paper, we focus on the stage of detection and alignment, based on this method, we get the performance which is better than the state-of-the-art method and close to human-level performance.

Face Alignment:
The pipeline is briefly introduced as follows:

(a) Use 6 base points to bound face.
(b) Use another 67 points to get 3D shape face.

Feature Representation:
The frontalized crop will be the input of the following DNN architecture.

Experiment:

Dataset:

Social Face Classification (SFC), 4.4M images, 4030 people.
Labeled Face in the Wild (LFW), 13.2K images, 5749 people.
Youtube Face (YTF), 3425 Youtube videos, 1595 subjects.

Result:
DeepFace can beat state-of-the-art method and be close to human-level performance.

2016年6月21日 星期二