2016年4月20日 星期三

Two-Stream Convolutional Networks for Action Recognition in Videos

When we are watching an video, we first select objects that we interested, then we see how the objects moving or doing something. Based on this intuition, we proposed two networks to do action recognition, where one is called spatial ConvNet, and the other is called temporal Convnet.

In spatial network, we use single still frame extracted from videos as the input. And in temporal network, we use optical flow displacement fields extracted from consecutive frames as our input. This paper mainly focuses on how they deal with optical flow features.

Optical flow

In high school physics, if an object moves from point a to b, the distance is called displacement. In consecutive 2 frames, we use the following figure to explain:

See the rectangle in (a) and (b), the hands is moving
In (c), the hands moving toward some direction is represented by the optical flow.
In (d), it's horizontal movement part.
In (e), it's vertical movement part.

It is obviously that if we see consecutive frames, we can easily guess what an object is doing, the more frames we see the answer is clearer. Now suppose we watch L consecutive frames, we introduce some stacking method to combine these feature extracted from L frames.

Optical flow stacking 

As we see in (d) and (e), for each consecutive 2 frames, we get 2 channels like (d) and (e). Given a static point, we record the displacement vector at that point for each frame. The idea is illustrated here:


Trajectory stacking

Given a point, in frame 0, we record the displacement vector and the stop point. Then in next frame (frame 1), we record the displacement starting from the stop point at the previous frame. The idea is also illustrated here:

For optical flow stacking, the object may be doing something but stand still, for example, archery competition. And for trajectory stacking, we may want to indicate that a man is walking.

Bi-directional optical flow

For frame 0 to L/2, we do the same thing as the previous two method do. But from frame L/2 to L, we record the reversed displacement vector, do this repeatedly until trace back to frame 0.

Mean flow subtraction

It's simple normalization, the details are not discussed here.
----------------------------------------------------------------------------------------------
Result

For spatial network, we can reach 72.8% only given still images. Adding information from consecutive frames, the result is as follow:
If we watch more frames, the answer we guess could be more precise. The result follows our intuition.

2016年4月6日 星期三

A Bayesian Hierarchical Model for Learning Natural Scene Categories

This paper proposes a method to automatically learn intermediate representations of an image, which each image is constructed of many codewords. Then in the learning process, we construct a model to figure out how likely the composition of codewords for each class.

The task can be simply described by this figure:
If we are given the following clues to figure out which category an image belong to, how will we do?
    (a) There are C classes, K themes, T codewords.
    (b) For each class, we know that some of themes are likely seen.
    (c) Given a class and some themes, it's easily to describe an image.
So we can do the following procedure:
    (a) Select a class c based on a probability distribution, called 'eta'.
    (b) Then select some themes which is likely to be seen given class c, also based on distribution 'sita'.
    (c) Using the given materials, then we choose some words from codebook to describe an image.
The description of an image can be also drawing from a distribution called 'beta'.
So here is the illustration of the procedure:
    
The arrows in this figure follow the procedure described in the texts. 'Eta', 'beta', and 'sita' is called latent variables. We cannot direct find it since the original data didn't give us. That means we can only guess it based on some observations.

Dataset

The dataset contains 13 categories and about 3.7K images.

The following two figures describes (a) the accuracy for each category (b) the size of each distribution