2016年6月21日 星期二

Nonlinear dimensionality reduction by locally linear embedding

Introduction:
The paper introduces the method of nonlinear dimension reduction.

Method:

In this figure, the middle part is the points sampled from the hyper-plane in A.

If we flatten the points in B, we could get C.

螢幕快照 2016-03-28 下午8.28.06.png

Here we have 3 steps:

1. For each point, select K neighbors.

2. Find K weights which minimize the squared error between X and WX.

3. In low-dimensional space, find Y (which is corresponding feature of X) which minimize the squared error between Y and WY.



Faster RCNN

Introduction:
The work aims to boost the speed and the performance of fast-rcnn. The bottleneck of fast-rcnn is object proposal. So the main method will focus on region proposal.

Method:

Region Proposal Network (RPN)

For RPN, the input is CNN conv feature map, and the output is rectangular object proposals along with object scores.

In fast-rcnn, selective search is used, and RPN will replace it in faster-rcnn.

Here is the illustration.

Result Compared with Fast-rcnn:


With fewer proposals, the mAP still beat fast-rcnn, and the execution time is also much less than fast-rcnn. So faster-rcnn performs better than fast-rcnn on both speed and accuracy. 

Deep neural networks for acoustic modeling in speech recognition

Introduction:
In speech recognition, GMM-HMM is strong before. But GMM has a big problem. If most points lie on the surface of a sphere, we can use little parameters to model that. But in GMM, it requires very large number of parameters. Now researchers replace GMM with DNN, and they have shown that DNN outperforms GMM.

Method:
Restricted Boltzmann machine (RBM)

1. First a GRBM is trained to model a window of frames of real-valued acoustic coefficients.

2. Then the states of the binary hidden units of the GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired.

3. Then the stack of RBMs is converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed connections.

4.Finally, a pre-trained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state of each HMM.



Experiment:


Sequence to Sequence – Video to Text



Introduction:
Given the input of sequences of video frames, can a machine outputs a sequence of words? Such kind of this problem is called video captioning. Current state-of-the-art work is using LSTM. In this paper, we combine CNN and LSTM, to generate sentences describing the video.

Method:
First, we extract the features of input frames, we apply AlexNet, VGG, and GoogleNet to do this work.

Then 2 LSTMs are used in second part. In this picture, the red one models the video sequence, and the green one models the word sequence.

螢幕快照 2016-06-02 下午3.18.58.pngThe green one are given text input and video hidden representation, which will predict the next word.

The flow images uses optical flow features extracted between frames.












Dataset:

Here we have 3 datasets.

螢幕快照 2016-06-02 下午3.25.41.png

Experiment Result:



2016年5月19日 星期四

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Introduction:

DNN is powerful in computer vision, such as image classifier, detection. But its advantage is that it has too much parameters. If we can discard many parameters but keep the performance, we can save more memory usage. In this paper, we describe 3 stages to remove parameters.

Here is the brief summary of this method:


Stage 1: Pruning

If a weight is below a threshold, we set it to zero. Now we only need to keep the position of the weights which is not zero. Instead of keeping the absolute position, we store the difference between consecutive addresses. Here is the illustration:
Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16 model.

Stage 2: Quantization and weight sharing

Now we cluster the weights into different clusters. In this figure, instead of using 32-bit floating number to store a weight, we now only use 2-bit index, since we have only 4 clusters.  


Stage 3: Huffman Coding

The basic idea of Huffman Coding is that using few bits to represent high-frequency things.

Experiment:

1. Parameters reduced but retain the loss.


2. Statistics about compressing AlexNet


3. Speedup

Text Understanding from Scratch



Introduction:

When we process sentences, some NLP models extract semantic level feature, like word2vec or N-gram models. In this work, it encodes sentences to character-level feature, which performs better than the former feature.

ConvNet

The main component of ConvNet is convolutional module, which computes a 1-D convolution between input and output.

The idea is briefly illustrated in this figure.

Character Quantization

Given a sentence, we quantize each character using 1-of-m encoding, where m is the number of alphabets. It's very simple but it works like Braille, which helps blind people reading.

If a sentence is longer than L characters, we remove those exceeding characters.
We use

Model Design

We design two ConvNets, which both have 6 convolutional layers and 3 fc layers.
The difference is the frame size. Here is the illustration of model.


Data Augmentation

The size of text data is always annoying. We need to do data augmentation if we have no sufficient data. Here we replace some words in a sentence with their synonyms.

Dataset

Here we use 5 datasets to evaluate our method.

(1) DBpedia, which has 14 classes, 560K training , 70K testing
(2) Amazon reviews, which has 5 classes, 3M training and 650K testing.
(3) Yahoo! Answers, which has 10 classes, 1.4M training, 60K testing.
(4) AG's news corpus, which has 4 classes, 120K training, 7.6K testing.
(5) Sogou News, which has 5 classes, 360K training, 60K testing.

Here we show these the result of (1):


Here we show these the result of (2):


Here we show these the result of (3):


Here we show these the result of (4):


Here we show these the result of (5):


2016年5月12日 星期四

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Introduction:
Most of works about face recognition is made up of 4 stages, detect, align, represent, classify. In the paper, we focus on the stage of detection and alignment, based on this method, we get the performance which is better than the state-of-the-art method and close to human-level performance.

Face Alignment:
The pipeline is briefly introduced as follows:

(a) Use 6 base points to bound face.
(b) Use another 67 points to get 3D shape face.


Feature Representation:
The frontalized crop will be the input of the following DNN architecture.



Experiment:

Dataset:

Social Face Classification (SFC), 4.4M images, 4030 people.
Labeled Face in the Wild (LFW), 13.2K images, 5749 people.
Youtube Face (YTF), 3425 Youtube videos, 1595 subjects.

Result:
DeepFace can beat state-of-the-art method and be close to human-level performance.




2016年4月20日 星期三

Two-Stream Convolutional Networks for Action Recognition in Videos

When we are watching an video, we first select objects that we interested, then we see how the objects moving or doing something. Based on this intuition, we proposed two networks to do action recognition, where one is called spatial ConvNet, and the other is called temporal Convnet.

In spatial network, we use single still frame extracted from videos as the input. And in temporal network, we use optical flow displacement fields extracted from consecutive frames as our input. This paper mainly focuses on how they deal with optical flow features.

Optical flow

In high school physics, if an object moves from point a to b, the distance is called displacement. In consecutive 2 frames, we use the following figure to explain:

See the rectangle in (a) and (b), the hands is moving
In (c), the hands moving toward some direction is represented by the optical flow.
In (d), it's horizontal movement part.
In (e), it's vertical movement part.

It is obviously that if we see consecutive frames, we can easily guess what an object is doing, the more frames we see the answer is clearer. Now suppose we watch L consecutive frames, we introduce some stacking method to combine these feature extracted from L frames.

Optical flow stacking 

As we see in (d) and (e), for each consecutive 2 frames, we get 2 channels like (d) and (e). Given a static point, we record the displacement vector at that point for each frame. The idea is illustrated here:


Trajectory stacking

Given a point, in frame 0, we record the displacement vector and the stop point. Then in next frame (frame 1), we record the displacement starting from the stop point at the previous frame. The idea is also illustrated here:

For optical flow stacking, the object may be doing something but stand still, for example, archery competition. And for trajectory stacking, we may want to indicate that a man is walking.

Bi-directional optical flow

For frame 0 to L/2, we do the same thing as the previous two method do. But from frame L/2 to L, we record the reversed displacement vector, do this repeatedly until trace back to frame 0.

Mean flow subtraction

It's simple normalization, the details are not discussed here.
----------------------------------------------------------------------------------------------
Result

For spatial network, we can reach 72.8% only given still images. Adding information from consecutive frames, the result is as follow:
If we watch more frames, the answer we guess could be more precise. The result follows our intuition.

2016年4月6日 星期三

A Bayesian Hierarchical Model for Learning Natural Scene Categories

This paper proposes a method to automatically learn intermediate representations of an image, which each image is constructed of many codewords. Then in the learning process, we construct a model to figure out how likely the composition of codewords for each class.

The task can be simply described by this figure:
If we are given the following clues to figure out which category an image belong to, how will we do?
    (a) There are C classes, K themes, T codewords.
    (b) For each class, we know that some of themes are likely seen.
    (c) Given a class and some themes, it's easily to describe an image.
So we can do the following procedure:
    (a) Select a class c based on a probability distribution, called 'eta'.
    (b) Then select some themes which is likely to be seen given class c, also based on distribution 'sita'.
    (c) Using the given materials, then we choose some words from codebook to describe an image.
The description of an image can be also drawing from a distribution called 'beta'.
So here is the illustration of the procedure:
    
The arrows in this figure follow the procedure described in the texts. 'Eta', 'beta', and 'sita' is called latent variables. We cannot direct find it since the original data didn't give us. That means we can only guess it based on some observations.

Dataset

The dataset contains 13 categories and about 3.7K images.

The following two figures describes (a) the accuracy for each category (b) the size of each distribution


2016年3月16日 星期三

Iterative Quantization: A Procrustean Approach to Learning Binary Codes

      We usually use d-dimension vector to represent an image, but if we consider the usage of memory, we hope to use smaller dimension binary vector to represent an image. But when we do this way, we may have some error, the error is better to be smaller. And the other way, for a new image, the algorithm should be efficient to calculate the binary representation vector.
    We hope that each bit contains more information, means that the variance should be high for each bit, and the inner-product for each pair-wise binary code, so we have the following formula:
    If we set the distribution of a bit to be half 1s and half 0s, we can easily find the maximum, so we rewrite the formula as follow:
    The objective function is the same as what PCA does, so we take the first c eigenvectors of XTX to be the initial W.
    In PCA, the variance for each component is not the same, so we want to rotate the components generated by PCA, makes the variance closer. Here is the image illustrating this idea: 
    With rotation, the quantization error in (c) is the smallest. We rewrite the objective function as follows:
    Where V=XW, B is the encoding result, and R is the rotation matrix, here we proposed a method called ITQ, for each iteration, two steps are implemented, (1)fix R and update B, (2)fix B, update R, with higher iterations, quantization error gets smaller, here is the experiment result:


2016年3月15日 星期二

Aggregating local descriptors into a compact image representation

    This paper focuses on searching images from a large dataset (10 million images or more). But here we have three parts to optimize.

  1. The representation of an image
  2. The dimension reduction of these representation vectors.
  3. The indexing algorithm.

    In this paper, it uses a descriptor called VLAD, which is derived from Bag-of-Words and Fisher Kernel, then aggregates SIFT descriptors.
    In the second part, it uses asymmetric distance computation to do approximate nearest neighbors search. Then it uses PCA to do dimension reduction.
   We mainly focus on part 1 and part 2. And here is the flow chart[1] to describe what this paper does.





     [1] http://users.auth.gr/espyromi/publications/slides/spyromitrosWIAMIS2012slides.pdf

Online Dictionary Learning for Sparse Coding

     When we write a sentence, we would like to use less words to express what we want to say. In sparse coding, if we have some vectors (called D), we are likely to do simple linear combination of these vectors to represent an image.
    Based on our goal, we have the following cost function:

    The loss function acts like a distance function, we should use D to represent images but don't want lose much information. Suppose the each x is m-dimensional vector, then D is a m*k-dimensional matrix, where k is smaller than the numbers of images.(e.g. k=200, n=100,000)
    The detail of the cost function is as follow:
    The second term is the regularization term, if we decrease alpha, make D larger, we can minimize the second term, but it helps nothing. We hope that we have the simplest alpha, so we rewrite the object function:  
    The above formula limits the size for each column of D, and the cost function follows the limitation:
    
    In the paper, we use online learning, which means we add an images each iteration, then update alpha and D, here is the algorithm:
     In the procedure of updating D, we use D_(t-1) as the initial point, here is the algorithm of updating D.
    The length for each column of D would be less than 1, which follows our constraint. Here we discuss some conditions we may counter:
    (1) In real data, we may have fewer images, which means we may select the same image in different iterations, suppose at time t0 we draw an image, at time t we draw the same image again, here is the way we update A:
    We discard the information of time t0. In the implementation, we hardly memory the history of images drawed.
   (2) Here we use a batch of images to update:
   (3) Some columns of D contain nothing, we should discard the columns when we update D.