AMMAI: A Bayesian Hierarchical Model for Learning Natural Scene Categories

This paper proposes a method to automatically learn intermediate representations of an image, which each image is constructed of many codewords. Then in the learning process, we construct a model to figure out how likely the composition of codewords for each class.

The task can be simply described by this figure:

If we are given the following clues to figure out which category an image belong to, how will we do?

(a) There are C classes, K themes, T codewords.

(b) For each class, we know that some of themes are likely seen.

So we can do the following procedure:

(a) Select a class c based on a probability distribution, called 'eta'.

(b) Then select some themes which is likely to be seen given class c, also based on distribution 'sita'.

The description of an image can be also drawing from a distribution called 'beta'.

So here is the illustration of the procedure:

The arrows in this figure follow the procedure described in the texts. 'Eta', 'beta', and 'sita' is called latent variables. We cannot direct find it since the original data didn't give us. That means we can only guess it based on some observations.

Dataset

The dataset contains 13 categories and about 3.7K images.

The following two figures describes (a) the accuracy for each category (b) the size of each distribution