2016年6月21日 星期二

Nonlinear dimensionality reduction by locally linear embedding

Introduction:
The paper introduces the method of nonlinear dimension reduction.

Method:

In this figure, the middle part is the points sampled from the hyper-plane in A.

If we flatten the points in B, we could get C.

螢幕快照 2016-03-28 下午8.28.06.png

Here we have 3 steps:

1. For each point, select K neighbors.

2. Find K weights which minimize the squared error between X and WX.

3. In low-dimensional space, find Y (which is corresponding feature of X) which minimize the squared error between Y and WY.



Faster RCNN

Introduction:
The work aims to boost the speed and the performance of fast-rcnn. The bottleneck of fast-rcnn is object proposal. So the main method will focus on region proposal.

Method:

Region Proposal Network (RPN)

For RPN, the input is CNN conv feature map, and the output is rectangular object proposals along with object scores.

In fast-rcnn, selective search is used, and RPN will replace it in faster-rcnn.

Here is the illustration.

Result Compared with Fast-rcnn:


With fewer proposals, the mAP still beat fast-rcnn, and the execution time is also much less than fast-rcnn. So faster-rcnn performs better than fast-rcnn on both speed and accuracy. 

Deep neural networks for acoustic modeling in speech recognition

Introduction:
In speech recognition, GMM-HMM is strong before. But GMM has a big problem. If most points lie on the surface of a sphere, we can use little parameters to model that. But in GMM, it requires very large number of parameters. Now researchers replace GMM with DNN, and they have shown that DNN outperforms GMM.

Method:
Restricted Boltzmann machine (RBM)

1. First a GRBM is trained to model a window of frames of real-valued acoustic coefficients.

2. Then the states of the binary hidden units of the GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired.

3. Then the stack of RBMs is converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed connections.

4.Finally, a pre-trained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state of each HMM.



Experiment:


Sequence to Sequence – Video to Text



Introduction:
Given the input of sequences of video frames, can a machine outputs a sequence of words? Such kind of this problem is called video captioning. Current state-of-the-art work is using LSTM. In this paper, we combine CNN and LSTM, to generate sentences describing the video.

Method:
First, we extract the features of input frames, we apply AlexNet, VGG, and GoogleNet to do this work.

Then 2 LSTMs are used in second part. In this picture, the red one models the video sequence, and the green one models the word sequence.

螢幕快照 2016-06-02 下午3.18.58.pngThe green one are given text input and video hidden representation, which will predict the next word.

The flow images uses optical flow features extracted between frames.












Dataset:

Here we have 3 datasets.

螢幕快照 2016-06-02 下午3.25.41.png

Experiment Result: