Geniferology: On visual recognition

I will try to consider visual recognition from a high level, though my approach may not be the only possible one.

We're given the image projected on the retina. We seek the model that generated the image. A mathematics-favored way is to think of the model as finitely generated from a set of primitives. (The primitives could be infinite too, but let's start with simpler.)

For example, the primitives could be a class of geometric shapes such as cylinders, rectangles, polygons, etc. These can be given concisely by their positions and size / shape parameters.

Bayesian learning

We are interested in finding the maximum of $P(model | data)$, ie, the most probable model given the data. The models would be specified by some parameters as model$(r)$.

Bayes' law says that:

$$P(model | data) = \frac{P(data | model) P(model)}{P(data)}.$$

What is special about this formula is that the right hand side contains $P(data | model)$ which is usually much easier to compute than the left hand side.

In our vision example, $P(data | model)$ means that we are given the model, so we can use the known function $f$ (see above figure) to compute the data. Whereas, $P(model | data)$ requires us to use $f^{-1}$ to compute the model which is much harder.

But I have not spelled out the details of the computation, particularly at the pixel level...

A highly abstract and simple example

Now let's try to understand the whole process from an abstract perspective (without reference to Bayesian learning).

To reduce matters to the simplest case, let's consider recognizing straight lines on a $100 \times 100$ image (with noise). Assume the input image always contains only 1 line, that could be slightly non-straight.

Every straight line can be represented by 2 parameters $(c,d)$ via its equation:

$ cx + dx = 1. $

So a naive, but impractical, algorithm is simply to:

visit all points in the parameter space $(c,d)$
draw each line as image data, $D$
compare $D$ with our input image $D_0$
return the line that minimizes the difference $\Delta(D, D_0)$

But of course, this is unfeasible because the search space $(c,d)$ is too large (indeed infinite).

Feature extraction

Vision researchers usually use a pre-processing stage known as feature extraction. I'm not saying it is necessary, but let's see what happens. The input image can be broken into a smaller number of edgels. Their number is vastly smaller than the number of pixels. Say, each edgel is 5 pixels long. So a line across a $100 \times 100$ image would be on the order of ~20 edgels long.

The situation can be summarized by this commutative diagram:

We know how to do $f$, which we assume is easy. And assume we know how to do $\Phi$. So we know how to do $\Psi = \Phi \circ f$.

Now instead of computing the error with the raw image:

$\Delta_1 = || f(\mathbf{r}) - D_0 || $

we can compare their features-space representations:

$\Delta_2 = || \Psi(\mathbf{r}) - \Phi(D_0) || $

which is more efficient (maybe?).

For our example, we can use $(c,d)$ to generate a set of features $F$, which are edgels each having a position and slope. In symbols:

$(c,d) \stackrel{\Psi}{\mapsto} F = \bigcup_{i=1}^N \{ (x_i, y_i, slope_i) \} $

where the x, y and slopes are simple functions of the parameters $\mathbf{r} = (c,d)$. The error could be defined as some kind of average of differences between individual edgel-pairs.

Notice that $\Psi^{-1}$ may still be difficult to compute.

See, the biggest inefficiency is actually not $\Psi$ itself, but is the size of the parameter space $(c,d)$, which forces us to evaluate $\Psi$ many times in a generate-and-test loop.

We can also see that feature extraction does not directly attack the bottleneck.

It would be very nice if we can directly calculate $\Psi^{-1}$. Then we can simply go from the input data $D_0$ via $\Phi$ and $\Psi^{-1}$ to get our answer, $(c,d)$. That does not require any searching! How can that be done?

Two possible strategies

In fact we have 2 possible strategies:

Along every step of the generative process $f$ or $\Psi$, use only invertible functions.
Use only differentiable functions in every step, so then we can set $\frac{\partial \Delta}{\partial r} = 0$ and so on to find the minimum of the error.

In both strategies, we are writing a functional program using a restricted class of functions. Then the computer can automatically find the inverse or derivatives for us, either numerically or symbolically.

In the first strategy, notice that some commonly-used functions do not have inverses. In particular, if the function is idempotent (ie, $f \circ f = f$), which includes min, max, projection, and set union. Note however, that these functions may be invertible if we allow interval-valued or set-valued functions. (Indeed, in vision we generally accept that one cannot see behind opaque objects, a consequence of the idempotency of occlusion.)

In the second strategy, we may need to create "soft" versions of non-differentiable functions such as min, max, and projection.

These are still rather vague, but I think are promising directions...

What about AI?

A final word is that we are just recognizing images as models composed of geometric primitives (such as cylinders). That is still different from natural labels such as "hands", "trees", "cars", or even descriptions such as "nude descending a staircase".

For the latter purpose, we need a further map, say $\Theta$, from natural-language descriptions to geometric models:

I have an impression that this map $\Theta$ is best represented using logical form. That is because many natural concepts, such as "chairs", do not have representations as connected regions in geometric space. Rather, they are defined via complex and intricate, "intensional" relationships. In other words, logical or relational. The formalism I'm developing is somewhat like logic, somewhat like algebra, and a bit like neurons. One can think of it as a universal language for expressing patterns in general.

1 comment:

Carsten CruseJanuary 15, 2013 at 4:59 AM
Dear YKY,

to analyse images by geometrical primitves or a logical/relational combination is only possible in easy worlds, not in complex worlds.

A better way is to find algorithms which automatic generate good features and their relation.

Best regards
Carsten

Blog Archive

Latex Maths

Tuesday, January 15, 2013

On visual recognition