Latex Maths

Thursday, January 24, 2013

Mathematics behind Google's page rank algorithm

Thanks to professor Raymond Chan of CUHK mathematics, who gave an undergrad talk on the mathematics of Google, which I attended by chance.  In this post I'll try to re-present what he said, with some of my own variations.  (Mistakes, if any, are mine.)

I also combined materials from chapter 2 of the book "A First Course in Applied Mathematics" by Jorge Rebaza (2012):



How much is the PageRank algorithm worth?


In 2012:
    Google annual revenue = US 50.2   billion
    Google total assets        = US 93.8   billion
    Google market cap        = US 248.2 billion
    Hong Kong's GDP         = US 243.7 billion

So the entire Hong Kong population working for a year, is roughly worth the outstanding stocks of Google.  Perhaps learning the PageRank algorithm can give us an estimate of how useful mathematics can be :)

The algorithm

As of 2012 Google is searching 8.9 billion web pages.  First, it builds a directed graph ($\Gamma$) of how the web pages are linked to each other.

Example 1:

$$ A = \left[ \begin{array}{c c c c c} 0 & 0 & 0 & 1 & 0 \\  0 & 0 & 0 & 0 & 1 \\  0 & 1/2 & 0 & 0 & 1/2 \\ 1 & 0 & 0 & 0 & 0 \\  0 & 1 & 0 & 0 & 0  \end{array}  \right] $$

Example 2:

$$ A = \left[ \begin{array}{c c c c c c} 0 & 1 & 0 & 0 & 0 & 0 \\  1/2 & 0 & 1/2 & 0 & 0 & 0 \\  1/2 & 1/2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1/3 & 0 & 1/3 & 1/3 \\  0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1/2 & 1/2 & 0 \end{array}  \right] $$

In the graph:
    nodes = web pages
    links   = out-going links from page to page

The matrices are the adjacency matrices of the graphs.  Each row represents the out-going links from the page $P_i$.  If a page has $n$ out-going links, each link is weighted by $1/n$.

Notice that each row of the matrices sum to 1.  These are called stochastic matrices.

Then the rank of a page is defined by adding up the ranks of pages with out-links pointing to it:
$$  r(P) = \sum_{i} \frac{r(P_i)}{|P_i|} $$
where $r(P_i)$ is the rank of Page $i$, and $|P_i|$ denotes the total number of out-going links from Page $i$.  The rationale is that it is a good thing to be pointed to, so the "goodness" of Page $i$ is passed to Page $P$;  but it has to be divided by $|P_i|$ because a page that points outwards "promiscuously" should have its influence diluted.

But this ranking has to be calculated iteratively starting from an initial ranking.  The iterative equation has the form:
$$ r_k(P) = \sum_i \frac{r_{k-1}(P_i)}{|P_i|}, \quad \quad k = 1, 2, ... $$

This means that the "goodness" (ranking) is passed around the graph iteratively until their values at every node converge to stable values.

In vector notation:  let $v_k = [ r_k(P_1) \; r_k(P_2) \; ... \; r_k(P_N) ]^T $.  Then the iterative equation becomes:
$$ v^T_k = v^T_{k-1} H $$
where
$$ H_{ij} = \left\{ \begin{array}{l@l} \frac{1}{|P_i|} & \mbox{if } P_i \mbox{ has a link to } P \\ 0 & \mbox{otherwise} \end{array} \right. $$

Notice that matrix H is constant throughout the iteration.  That means we are repeatedly multiplying by H:
$$ v = \lim_{k \rightarrow \infty} v_k = \lim_{k \rightarrow \infty} H^k v_0 $$
where $v$ is the PageRank vector.

The above is reminiscent of the way we repeatedly press the "cosine" button on a calculator, and see the results converge.  That value is the solution to the equation:
$$ \cos x = x. $$
In general, when we want to solve an equation of the form $ f(x) = x $, we can repeatedly apply $f()$ starting from an initial value $x_0$.  The point for which $f(x^*) = x^*$ is called the fixed point, a very useful concept in mathematics.

The above iteration is called the power method, which will converge to the eigenvector with the greatest eigenvalue of the matrix $H$.  This is called the dominant eigenvector, which I will now explain...

Power method

This section explains why the power method converges.  Recall the definition of eigenvector and eigenvalue:
$$ A x = \lambda x $$
where $x$ = eigenvector, $\lambda$ = eigenvalue.

Assume that the matrix A satisfies these conditions:

  • $A$ has a complete set of linearly independent eigenvectors $v_1, ... v_n$
  • the corresponding eigenvalues can be ordered:
                  $ |\lambda_1| > |\lambda_2| \ge ... \ge |\lambda_n| $
    (notice the strict > in the first term)
For iteration, choose an arbitrary initial vector $x_0$.  Due to the eigenvectors forming a basis, we can express $x_0$ as:
$$ x_0 = c_1 v_1 + ... + c_n v_n $$

Now we multiply the above with matrix A and get:
$$ \begin{array}{l@=l} A x_0 & = c_1 \lambda_1 v_1 + c_2 \lambda_2 v_2 + ... + c_n \lambda_n v_n \\ A^2 x_0 & = c_1 \lambda_1^2 v_1 + c_2 \lambda_2^2 v_2 + ... + c_n \lambda_n^2 v_n \\ & \vdots \\ A^k x_0 & = c_1 \lambda_1^k v_1 + c_2 \lambda_2^k v_2 + ... + c_n \lambda_n^k v_n \end{array} $$

And because $ |\lambda_1| > |\lambda_2| \ge ... \ge |\lambda_n| $, as $k \rightarrow \infty$, the weight of the first term outgrows those of the rest, so the vector $A^k x_0$ moves towards the dominant eigenvector direction.

The rate of convergence is determined by the ratio $\frac{ |\lambda_2| } { |\lambda_1| }$.


For the power method to converge, we need to ensure that the matrix is stochastic and irreducible....

Irreducible matrices

The section explains the necessary condition for the power method to converge;  but the measure to force convergence also gave rise to Google's personalized search, as we shall see why.

A matrix is called reducible if there exists a permutation matrix $P$ such that:
$$ P^T A P = \left[ \begin{array}{cc} C& D \\ \mathbf{0} & E \end{array} \right] $$
where $\mathbf{0}$ is a $(n-r) \times r$ block matrix with 0's.

Observe that the above definition expresses the similarity between $A$ and a block upper triangular matrix.  (Note:  2 matrices $A$ and $B$ are similar if there is a $P$ such that $B = P A P^{-1}$, where P is any invertible matrix.  This means that the effect of applying $A$ is the same as applying $B$ between a change of coordinates and back.  In the case of permutation matrices, $P^{-1}$ coincides with $P^T$.)

The significance of this definition is that:
matrix A is irreducible $\Longleftrightarrow \Gamma(A) $ is strongly connected

A graph is strongly connected if every node is reachable from any other node.

Example:
where
$$ A = \left[ \begin{array}{c c c c} 0 & 1/3 & 1/3 & 1/ 3 \\ 1/2 & 0 & 1/2 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{array} \right] $$


$$ B = \left[ \begin{array}{c c c c} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1/3 & 1/3 & 0 & 1/3 \\ 1/2 & 0 & 1/2 & 0  \end{array} \right] $$

$$ B = P^T A P $$ 

$$ P = \left[ \begin{array}{c c c c} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0  \end{array} \right] $$

The effect of multiplying by $P$ (a permutation matrix), is to exchange some rows, hence the name.
Notice that there is no way in matrix $A$ to go from the nodes 3 & 4 to nodes 1 & 2.  Thus matrix $A$ is reducible.

To ensure irreducibility (in order for the power method to work), Google forces every page to be reachable from every other page.

Google's solution is to add a perturbation matrix to $H$:
$$ G = \alpha H + (1 - \alpha) E  $$
where $G$ is the Google matrix, $\alpha \in (0,1)$, and $E = e \; e^T / n$ where $e$ is a vector or 1's.  This makes every page reachable to every page, but later Google modified this to:
$$ E = e \; u^T $$
where $u$ is a probabilistic vector called the personalization vector, which plays a great role in Google's success.  More details are explained in Rebaza's book.

Hope this is helpful to you :)

Tuesday, January 15, 2013

On visual recognition

I will try to consider visual recognition from a high level, though my approach may not be the only possible one.



We're given the image projected on the retina.  We seek the model that generated the image.  A mathematics-favored way is to think of the model as finitely generated from a set of primitives.  (The primitives could be infinite too, but let's start with simpler.)

For example, the primitives could be a class of geometric shapes such as cylinders, rectangles, polygons, etc.  These can be given concisely by their positions and size / shape parameters.

Bayesian learning


We are interested in finding the maximum of  $P(model | data)$, ie, the most probable model given the data.  The models would be specified by some parameters as model$(r)$.

Bayes' law says that:
$$P(model | data) = \frac{P(data | model) P(model)}{P(data)}.$$
What is special about this formula is that the right hand side contains $P(data | model)$ which is usually much easier to compute than the left hand side.

In our vision example, $P(data | model)$ means that we are given the model, so we can use the known function $f$ (see above figure) to compute the data.  Whereas, $P(model | data)$ requires us to use $f^{-1}$ to compute the model which is much harder.

But I have not spelled out the details of the computation, particularly at the pixel level...


A highly abstract and simple example


Now let's try to understand the whole process from an abstract perspective (without reference to Bayesian learning).

To reduce matters to the simplest case, let's consider recognizing straight lines on a $100 \times 100$ image (with noise).  Assume the input image always contains only 1 line, that could be slightly non-straight.

Every straight line can be represented by 2 parameters $(c,d)$ via its equation:
$ cx + dx = 1. $

So a naive, but impractical, algorithm is simply to:
  • visit all points in the parameter space $(c,d)$
  • draw each line as image data, $D$
  • compare $D$ with our input image $D_0$
  • return the line that minimizes the difference $\Delta(D, D_0)$
But of course, this is unfeasible because the search space $(c,d)$ is too large (indeed infinite).

Feature extraction

Vision researchers usually use a pre-processing stage known as feature extraction.  I'm not saying it is necessary, but let's see what happens.  The input image can be broken into a smaller number of edgels.  Their number is vastly smaller than the number of pixels.  Say, each edgel is 5 pixels long.  So a line across a $100 \times 100$ image would be on the order of ~20 edgels long.

The situation can be summarized by this commutative diagram:


We know how to do $f$, which we assume is easy.  And assume we know how to do $\Phi$.  So we know how to do $\Psi = \Phi \circ f$.

Now instead of computing the error with the raw image:
$\Delta_1 = || f(\mathbf{r}) - D_0 || $
we can compare their features-space representations:
$\Delta_2 = || \Psi(\mathbf{r}) - \Phi(D_0) || $
which is more efficient (maybe?).

For our example, we can use $(c,d)$ to generate a set of features $F$, which are edgels each having a position and slope.  In symbols:
$(c,d) \stackrel{\Psi}{\mapsto} F = \bigcup_{i=1}^N \{ (x_i, y_i, slope_i) \} $
where the x, y and slopes are simple functions of the parameters $\mathbf{r} = (c,d)$.  The error could be defined as some kind of average of differences between individual edgel-pairs.

Notice that $\Psi^{-1}$ may still be difficult to compute.

See, the biggest inefficiency is actually not $\Psi$ itself, but is the size of the parameter space $(c,d)$, which forces us to evaluate $\Psi$ many times in a generate-and-test loop.

We can also see that feature extraction does not directly attack the bottleneck.

It would be very nice if we can directly calculate $\Psi^{-1}$.  Then we can simply go from the input data $D_0$ via $\Phi$ and $\Psi^{-1}$ to get our answer, $(c,d)$.  That does not require any searching!  How can that be done?


Two possible strategies

In fact we have 2 possible strategies:
  • Along every step of the generative process $f$ or $\Psi$, use only invertible functions.
  • Use only differentiable functions in every step, so then we can set $\frac{\partial \Delta}{\partial r} = 0$ and so on to find the minimum of the error.
In both strategies, we are writing a functional program using a restricted class of functions.  Then the computer can automatically find the inverse or derivatives for us, either numerically or symbolically.

In the first strategy, notice that some commonly-used functions do not have inverses.  In particular, if the function is idempotent (ie, $f \circ f = f$), which includes min, max, projection, and set union. Note however, that these functions may be invertible if we allow interval-valued or set-valued functions.  (Indeed, in vision we generally accept that one cannot see behind opaque objects, a consequence of the idempotency of occlusion.)

In the second strategy, we may need to create "soft" versions of non-differentiable functions such as min, max, and projection.

These are still rather vague, but I think are promising directions...

What about AI?

A final word is that we are just recognizing images as models composed of geometric primitives (such as cylinders).  That is still different from natural labels such as "hands", "trees", "cars", or even descriptions such as "nude descending a staircase".

For the latter purpose, we need a further map, say $\Theta$, from natural-language descriptions to geometric models:


I have an impression that this map $\Theta$ is best represented using logical form.  That is because many natural concepts, such as "chairs", do not have representations as connected regions in geometric space.  Rather, they are defined via complex and intricate, "intensional" relationships.  In other words, logical or relational.  The formalism I'm developing is somewhat like logic, somewhat like algebra, and a bit like neurons.  One can think of it as a universal language for expressing patterns in general.