Geniferology

Tuesday, June 3, 2014

Cayley graphs of free groups and monoids

This is a Cayley graph:

Its nodes represent elements of the free group generated by {a, b}. It goes out in 4 directions from the origin: ${a, b, a^{-1}, b^{-1}}$.

This is an animated GIF of the same graph:

But it only recursively repeat in 3 directions, as $a \cdot a^{-1} = 1$ ("going backwards") is cancelled out.

The graph itself is of fractal dimension, but it is embedded on the 2D plane (of dimension 2), that is because the group is generated by 2 elements.

If we ignore the non-commutativity of the group, in other words $a \cdot b = b \cdot a$, then we don't have to shrink the edges, and the Cayley graph becomes a regular grid like this:

This can be regarded simply as the direct product of the generator groups {a} and {b}. And it clearly shows why 2 dimensions are needed.

But interestingly, we can also plot the Cayley graph of a free group with 3 elements on the 2D plane.

My first attempt is to divide the full circle of $360^\circ$ into 6 points, with 3 points representing {a, b, c} and 3 others the inverses. But unfortunately there is a "clash" between some elements:

So the solution is to shrink the edges not by 1/2 but by 1/3. This is the result:

This trick can work for any number of generators, by just cutting 2n divisions of the full circle. For example this is case n=4:

Case n=10:

Free monoids

What if we ignore the inverse elements, so we get free monoids? We can have 4 elements {a, b, c, d} and the graph is as follows:

It is clear that the nodes are "space-filling" as every point in the square can be expressed as a binary fraction.

Thus every point uniquely identifies a word in the free monoid generated by 4 elements. Beware that some edges are overlapping in this graph.

The case of n=3 monoid is a bit interesting, as it turns out to be the Sierpinsky triangle!

When n gets big, the monoid graphs looks exactly the same as the group graphs for n/2. For example, this is monoid, n=20:

And this is monoid, n=100:

Code

The code is very short, written in HTML5 Canvas and Javascript. You can download the .zip here.

Why the interest?

I am interested in this because the free groups / monoids can represent logical formulas (when the logic is in algebraic form). If we add a "gray scale" or "colors" to the nodes, that gives them extra dimensions that can be regarded as truth values. Thus, an entire knowledge base can be represented by such a continuously-colored graph.

In other words, we have a geometric representation of knowledge elements that can be multiplied and added together. Multiplication is the monoid action, whereas addition is the scalar addition of truth (color) values.

I will write about this last idea in more details.... later :)

Friday, May 23, 2014

P = NP, my first stab at the problem

[ This idea is inspired by Karmarkar's polynomial-time interior point method for linear programming. But it turned out to be a dead end, at least from my current perspective. Anyway, I leave the record here for future's sake and it might illuminate other people. ]

In mathematics, there is a long tradition of using "smooth" things to calculate "discrete, combinatorial" things. A famous early example is Euler's use of Newton/Leibniz's integral to approximate infinite series of the form $ \sum^n_{i=1} f(i) $, also known as Euler-Maclaurin summation [cf the book "Mathematical Masterpieces", Knoebel et al 2007]. A more recent example is Alexander Grothendieck's invention of schemes that builds a bridge between the (smooth) algebraic geometry and (discrete) Diophantine geometry.

So my hunch is to "move the SAT problem to a smooth setting and then apply smooth techniques". In fact, that was exactly what happened for the linear programming problem -- the simplex algorithm is discrete in the sense that it traverses the vertex points on the boundary of the feasible solution space, such a traversal tends to be exponential because the number of vertices grows exponentially as the number of equations. Whereas, the "interior point" methods avoid the boundary and vertices and instead manipulate "ellipsoids" that bound the feasible space.

Using an analogy, imagine a hotel with an exponential number of rooms, where some of the rooms contain the prize, but we have only a polynomial amount of time to check the rooms. On the surface it seems impossible; that is why P=NP is hard. But it is only the size of the ambient space that is exponential; the targets are finitely generated from a description of length N (= the input size); so there is hope.

Now add to the mix the inspiration from "interior point" methods: our algorithm should avoid testing the hotel rooms -- because as soon as we test the first room we may get trapped into testing an exponential number of rooms, just as in linear programming we have to avoid "touching" the boundary. Perhaps we should iterate through the input constraints, updating a certain "data structure" (which could be any mathematical structure in a broad sense), and such a structure would asymptotically approach the feasible solution(s).

The final shape of this data structure cannot contain the entire solution set, since it has an exponential number of vertices, and there isn't enough time to update that many vertices before the answer should be out. It cannot even contain a subset of those vertices, because then some of the vertices have to be eliminated in the process, thus making our function discontinuous. (Topologically, continuous means that the pre-image of every open set remains open).

Our function cannot output [0,1] because the answer may jump between 0 and 1, again discontinuous.

Notice that SAT is equivalent to deciding whether a set of polynomials have solutions in the reals ("existential theory of the reals"). Currently the best algorithm for this is singly exponential [cf: "Algorithms in real algebraic geometry", Basu et al 2006, ch.13].

But our function cannot output the volume of the set of real solutions, as the volume depends on the entirety of the boundary, which we want to avoid. The Oleinik-Petrovsky-Thom-Milnor theorem [cf Basu et al 2006, p.4] set a bound on the sum of Betti numbers of an algebraic set, which is polynomial in the degree and exponential in the number of variables. The first Betti number is the number of connected components of the set. This seems to confirm my intuition that the boundary is of "exponential" complexity and thus should be avoided.

After eliminating these options, there may still be a viable path which is to find the "minimal" real solution in the algebraic set. That means we take only the real component of z and require that its Euclidean norm, $||Re z||_2$, be smallest. A slight glitch is that the real part of a complex number is given by $Re z = z + z^*$ where $z^*$ denotes complex conjugation, which cannot be expressed in a polynomial formula.

But I realized that a fatal problem of this approach is that we still have not avoided testing an exponential number of points. This can be illustrated by this picture:

The regions are defined by a set of $k$ polynomials of maximum degree $m$, in $n$ variables. We need to find the common intersections (if any exists) of all $k$ polynomials. Figuratively speaking (as this diagram is not realistic), the potential number of intersections is related to the number of "peaks", ie the degree of the polynomials.

If we must test each "peak" against each other, for all $k$ polynomials, the maximal number of tests would be $m^k = O(m^n)$, ie, exponential in the number of variables.

To put it succinctly, the problem is: When we test a peak at a certain location, our effort spent at that location does not contribute to testing the peaks at other locations, and we have an exponential number of peak-to-peak combinations to be tested.

This is the general feeling one has when tackling an NP-hard problem.

So, the critical insight that may ultimately solve NP is that we should not waste time testing local cases, but we should exploit the fact that the localities are somehow correlated to each other.

It is well known that 2-SAT is in P and 3-SAT is NP-complete. In other words, when the degree of the polynomials gets to 3, we get NP-hardness. Any degree-higher-than-3 problems can be reduced to degree-3 instances, so we only need to consider cubic surfaces. That would be my next goal.... :)

Tuesday, February 25, 2014

P = NP can be proven?

I have a hunch that P=NP can actually be proven.

It is known that in this chain of inclusions:
$$ P \subseteq NP \subseteq PSPACE \subseteq EXPTIME $$ at least one inclusion is proper (ie, equality does not hold), by the "Time Hierarchy theorem". Other than that, none of the inclusions have been proven to be proper so far. Note that this theorem is proved via diagonalization, ie, Cantor's theorem that the cardinality of a power set must be bigger than the cardinality of the set itself. In other words, the proof is based on cardinality. But (with my limited knowledge and saying this vaguely) I have not known of similar cardinality proofs that are "smaller" than Cantor's. Does it mean that there is only 1 inequality within that chain?

Wednesday, May 15, 2013

What is Montague grammar?

Montague grammar

Richard Montague developed his theory of grammar from 1950-1970s. What is special about Montague grammar is that, whereas Chomsky's transformational grammar provides a formal account for the syntax of natural language, Montague grammar has both formal syntax and semantics.

Basically, Montague uses categorial grammar to describe the syntax of a natural language. Then he establishes formal translation rules that maps natural-language sentences to logical formulas in intensional logic. Such formulas have well-defined semantics based on possible worlds.

Let's turn to the semantics first, since it is the most interesting and relevant to AGI...

Intensional logic (內涵邏輯)

Intensional logic originated as an attempt to resolve paradoxes due to confusing references with senses (指涉和涵義), or extensions with intensions (外延和內涵). Its idea has been studied by Frege (1892), Lewis (1914), Tarski (1933), Carnap (1947), Kripke (1963), Montague (1970).

To say that:
The Empire State Building is the tallest building in New York
is not the same as saying:
The tallest building in New York is the tallest building in New York
even though we have substituted equal by equal. How come?

The problem is that "the tallest building in New York" is a symbol that has a reference (or extension) which is the object it refers to, and a sense (or intension) which is more subtle and depends on the context under which it is interpreted.

In different possible worlds (such as different time periods), "the tallest building in New York" can refer to different objects. While the extension of a symbol can be taken as the set of individuals it refers to, the intension of the symbol cannot be such a set, but has to be the function that maps from possible worlds to individuals in the worlds. As shown in the diagram:

In other words, the function associates to each possible world an individual in that world which the symbol refers to. That is:

$ f: I \rightarrow U $

$ f: i \mapsto u $

where $I$ is the set of possible worlds (or interpretations), and $U$ is the set of individuals that figure in any of the worlds.

The intensional logic (IL) as described by Montague is a meta-language based on $\lambda$-calculus, that allows to define various modal operators, so that it can subsume modal logic, temporal logic, deontic logic, and epistemic logic, etc.

Compound nouns

I discovered that intension functions may also help us define the semantics of compound nouns, such as:

cyber sex	sex that happens on the internet
glass slippers	slippers made of glass
couch potato	person who stays in the couch all the time

Some compound nouns are easy to interpret as they are simply the conjunction of their constituent atomic concepts. For example, a "girl scout" is both a "girl" and a "scout". However, a "girl school" is not a "girl", but is for girls. The meaning of such atypical compound nouns seem to require abductive interpretation (溯因解釋). Abduction means to search for explanations for what is known, which is the reverse of deduction which searches from what is known to new conclusions.

Now with the insight from intensional logic, perhaps we can interpret compound nouns like this:

So each modifier is a function that maps from possible worlds to individuals.

The advantage of such a treatment is that concept composition in general can be uniformly represented by intensional logic.

What is a function?

When we say "there exists a function...", it can be deceptive as the function can be any arbitrary mapping -- for example, it can include a bizarre map that maps an individual to Confucius in world$_1$, to the number 7 in world$_2$, to the Empire State Building in world$_3$, etc. What we need is to operationalize the concept of functions.

In the logic-based paradigm, a function is implemented as a set of logic formulas that determine the elements of a predicate. For example, the Prolog program:
ancestor(X,Y) :- father(X,Y)
ancestor(X,Y) :- ancestor(X,Z), father(Z,Y)
etc...
determines the elements of the "ancestor" relation.

So we have found an operational implementation for functions, and as logic programs are Turing universal, they are also the broadest class of functions we can define. One question is whether we can simplify the logical form, while keeping its Turing universality.

Possible world semantics

Possible world semantics were developed by Carnap and Kripke, among others, but Leibniz was the first "logician" to use the term "possible world".

The relevance to intensional logic is that each intensional individual is a function from the set of possible worlds (W) to the set of possible individuals (D):
$ f : W \rightarrow D $.
In other words, the set of all intensional individuals is $D^W$.

Possible world semantics is a way to model modal logics (模態邏輯). Modal logic is distinguished by the use of the modal operators $\Box$ "necessarily" and $\diamond$ "possibly". To model a logic means to establish a correspondence between propositions (命題) in the logic and an underlying algebraic structure called interpretations, such that the truth of propositions in the logic can be determined (or evaluated) by looking into that algebraic structure. In the case of modal logic, its model can be given by Kripke structures, or possible worlds.

What is a Kripke structure? It is a tuple $\left< W, R, V \right>$ of possible worlds $W$, a binary accessibility relation $R$ between worlds, and a valuation map $V$ assigning {true, false} to each proposition $p$ in possible world $w$.

Modal truths are defined by this interpretation rule:
$ \Box \phi \Leftrightarrow \forall w \in W. \phi $
which means that $\phi$ is necessarily true, if $\phi$ is true in all possible worlds in $W$. There is also a dual rule for $\diamond$.

This is an example Kripke structure:

Where the $w$'s are possible worlds, with $w_0$ being a distinguished "current state" or "real world", the arrows are "reachable" relations, and the $a, b$'s are propositions true in those worlds. This example may model a computational process, where we want to ensure, say, "state 0 will never be reached after N steps". Such a statement can be translated into modal logic and be verified by a model checker. This debugging technique is what enables the design of highly complex modern microprocessors.

Categorial grammar

{ For completeness' sake ... }

References:

[1] Portner 2005. What is meaning? - fundamentals of formal semantics. Blackwell Publishing.

[2] James McCawley 1993. Everything that linguists have always wanted to know about logic *but were ashamed to ask, 2nd ed. Chicago.

[3] 朱水林 1992 (ed). 邏輯語義學研究 (Studies in logical semantics)。上海教育出版社

[4] 鄒崇理 2000. 自然語言研究 (Natural language linguistics)。北京大學出版社

[5] 陳道德 2007 (ed). 二十世紀意義理論的發展與語言邏輯的興起 (20th century semantic theory and the rise of logical linguistics)。北京社會科學出版社

[6] 方立 2000. 邏輯語義學 (Logical semantics: an introduction)。北京語言文化大學出版社

[7] J v Benthem 2010. Modal logic for open minds. CSLI publications

[8] Gopalakrishnan 2006. Computation engineering - applied automata theory and logic. Springer.

Thursday, January 24, 2013

Mathematics behind Google's page rank algorithm

Thanks to professor Raymond Chan of CUHK mathematics, who gave an undergrad talk on the mathematics of Google, which I attended by chance. In this post I'll try to re-present what he said, with some of my own variations. (Mistakes, if any, are mine.)

I also combined materials from chapter 2 of the book "A First Course in Applied Mathematics" by Jorge Rebaza (2012):

How much is the PageRank algorithm worth?

In 2012:
Google annual revenue = US 50.2 billion
Google total assets = US 93.8 billion
Google market cap = US 248.2 billion
Hong Kong's GDP = US 243.7 billion

So the entire Hong Kong population working for a year, is roughly worth the outstanding stocks of Google. Perhaps learning the PageRank algorithm can give us an estimate of how useful mathematics can be :)

The algorithm

As of 2012 Google is searching 8.9 billion web pages. First, it builds a directed graph ($\Gamma$) of how the web pages are linked to each other.

Example 1:

$$ A = \left[ \begin{array}{c c c c c} 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1/2 & 0 & 0 & 1/2 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \end{array} \right] $$

Example 2:

$$ A = \left[ \begin{array}{c c c c c c} 0 & 1 & 0 & 0 & 0 & 0 \\ 1/2 & 0 & 1/2 & 0 & 0 & 0 \\ 1/2 & 1/2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1/3 & 0 & 1/3 & 1/3 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1/2 & 1/2 & 0 \end{array} \right] $$

In the graph:
nodes = web pages
links = out-going links from page to page

The matrices are the adjacency matrices of the graphs. Each row represents the out-going links from the page $P_i$. If a page has $n$ out-going links, each link is weighted by $1/n$.

Notice that each row of the matrices sum to 1. These are called stochastic matrices.

Then the rank of a page is defined by adding up the ranks of pages with out-links pointing to it:

$$ r(P) = \sum_{i} \frac{r(P_i)}{|P_i|} $$
where $r(P_i)$ is the rank of Page $i$, and $|P_i|$ denotes the total number of out-going links from Page $i$. The rationale is that it is a good thing to be pointed to, so the "goodness" of Page $i$ is passed to Page $P$; but it has to be divided by $|P_i|$ because a page that points outwards "promiscuously" should have its influence diluted.

But this ranking has to be calculated iteratively starting from an initial ranking. The iterative equation has the form:
$$ r_k(P) = \sum_i \frac{r_{k-1}(P_i)}{|P_i|}, \quad \quad k = 1, 2, ... $$

This means that the "goodness" (ranking) is passed around the graph iteratively until their values at every node converge to stable values.

In vector notation: let $v_k = [ r_k(P_1) \; r_k(P_2) \; ... \; r_k(P_N) ]^T $. Then the iterative equation becomes:
$$ v^T_k = v^T_{k-1} H $$
where
$$ H_{ij} = \left\{ \begin{array}{l@l} \frac{1}{|P_i|} & \mbox{if } P_i \mbox{ has a link to } P \\ 0 & \mbox{otherwise} \end{array} \right. $$

Notice that matrix H is constant throughout the iteration. That means we are repeatedly multiplying by H:
$$ v = \lim_{k \rightarrow \infty} v_k = \lim_{k \rightarrow \infty} H^k v_0 $$
where $v$ is the PageRank vector.

The above is reminiscent of the way we repeatedly press the "cosine" button on a calculator, and see the results converge. That value is the solution to the equation:
$$ \cos x = x. $$
In general, when we want to solve an equation of the form $ f(x) = x $, we can repeatedly apply $f()$ starting from an initial value $x_0$. The point for which $f(x^*) = x^*$ is called the fixed point, a very useful concept in mathematics.

The above iteration is called the power method, which will converge to the eigenvector with the greatest eigenvalue of the matrix $H$. This is called the dominant eigenvector, which I will now explain...

Power method

This section explains why the power method converges. Recall the definition of eigenvector and eigenvalue:
$$ A x = \lambda x $$
where $x$ = eigenvector, $\lambda$ = eigenvalue.

Assume that the matrix A satisfies these conditions:

$A$ has a complete set of linearly independent eigenvectors $v_1, ... v_n$
the corresponding eigenvalues can be ordered:
$ |\lambda_1| > |\lambda_2| \ge ... \ge |\lambda_n| $
(notice the strict > in the first term)

For iteration, choose an arbitrary initial vector $x_0$. Due to the eigenvectors forming a basis, we can express $x_0$ as:

$$ x_0 = c_1 v_1 + ... + c_n v_n $$

Now we multiply the above with matrix A and get:

$$ \begin{array}{l@=l} A x_0 & = c_1 \lambda_1 v_1 + c_2 \lambda_2 v_2 + ... + c_n \lambda_n v_n \\ A^2 x_0 & = c_1 \lambda_1^2 v_1 + c_2 \lambda_2^2 v_2 + ... + c_n \lambda_n^2 v_n \\ & \vdots \\ A^k x_0 & = c_1 \lambda_1^k v_1 + c_2 \lambda_2^k v_2 + ... + c_n \lambda_n^k v_n \end{array} $$

The rate of convergence is determined by the ratio $\frac{ |\lambda_2| } { |\lambda_1| }$.

For the power method to converge, we need to ensure that the matrix is stochastic and irreducible....

Irreducible matrices

The section explains the necessary condition for the power method to converge; but the measure to force convergence also gave rise to Google's personalized search, as we shall see why.

A matrix is called reducible if there exists a permutation matrix $P$ such that:
$$ P^T A P = \left[ \begin{array}{cc} C& D \\ \mathbf{0} & E \end{array} \right] $$
where $\mathbf{0}$ is a $(n-r) \times r$ block matrix with 0's.

Observe that the above definition expresses the similarity between $A$ and a block upper triangular matrix. (Note: 2 matrices $A$ and $B$ are similar if there is a $P$ such that $B = P A P^{-1}$, where P is any invertible matrix. This means that the effect of applying $A$ is the same as applying $B$ between a change of coordinates and back. In the case of permutation matrices, $P^{-1}$ coincides with $P^T$.)

The significance of this definition is that:

matrix A is irreducible $\Longleftrightarrow \Gamma(A) $ is strongly connected

A graph is strongly connected if every node is reachable from any other node.

Example:

where
$$ A = \left[ \begin{array}{c c c c} 0 & 1/3 & 1/3 & 1/ 3 \\ 1/2 & 0 & 1/2 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{array} \right] $$

$$ B = \left[ \begin{array}{c c c c} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1/3 & 1/3 & 0 & 1/3 \\ 1/2 & 0 & 1/2 & 0 \end{array} \right] $$

$$ B = P^T A P $$

$$ P = \left[ \begin{array}{c c c c} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{array} \right] $$

The effect of multiplying by $P$ (a permutation matrix), is to exchange some rows, hence the name.
Notice that there is no way in matrix $A$ to go from the nodes 3 & 4 to nodes 1 & 2. Thus matrix $A$ is reducible.

To ensure irreducibility (in order for the power method to work), Google forces every page to be reachable from every other page.

Google's solution is to add a perturbation matrix to $H$:

$$ G = \alpha H + (1 - \alpha) E $$

where $G$ is the Google matrix, $\alpha \in (0,1)$, and $E = e \; e^T / n$ where $e$ is a vector or 1's. This makes every page reachable to every page, but later Google modified this to:

$$ E = e \; u^T $$

where $u$ is a probabilistic vector called the personalization vector, which plays a great role in Google's success. More details are explained in Rebaza's book.

Hope this is helpful to you :)

Tuesday, January 15, 2013

On visual recognition

I will try to consider visual recognition from a high level, though my approach may not be the only possible one.

We're given the image projected on the retina. We seek the model that generated the image. A mathematics-favored way is to think of the model as finitely generated from a set of primitives. (The primitives could be infinite too, but let's start with simpler.)

For example, the primitives could be a class of geometric shapes such as cylinders, rectangles, polygons, etc. These can be given concisely by their positions and size / shape parameters.

Bayesian learning

We are interested in finding the maximum of $P(model | data)$, ie, the most probable model given the data. The models would be specified by some parameters as model$(r)$.

Bayes' law says that:

$$P(model | data) = \frac{P(data | model) P(model)}{P(data)}.$$

What is special about this formula is that the right hand side contains $P(data | model)$ which is usually much easier to compute than the left hand side.

In our vision example, $P(data | model)$ means that we are given the model, so we can use the known function $f$ (see above figure) to compute the data. Whereas, $P(model | data)$ requires us to use $f^{-1}$ to compute the model which is much harder.

But I have not spelled out the details of the computation, particularly at the pixel level...

A highly abstract and simple example

Now let's try to understand the whole process from an abstract perspective (without reference to Bayesian learning).

To reduce matters to the simplest case, let's consider recognizing straight lines on a $100 \times 100$ image (with noise). Assume the input image always contains only 1 line, that could be slightly non-straight.

Every straight line can be represented by 2 parameters $(c,d)$ via its equation:

$ cx + dx = 1. $

So a naive, but impractical, algorithm is simply to:

visit all points in the parameter space $(c,d)$
draw each line as image data, $D$
compare $D$ with our input image $D_0$
return the line that minimizes the difference $\Delta(D, D_0)$

But of course, this is unfeasible because the search space $(c,d)$ is too large (indeed infinite).

Feature extraction

Vision researchers usually use a pre-processing stage known as feature extraction. I'm not saying it is necessary, but let's see what happens. The input image can be broken into a smaller number of edgels. Their number is vastly smaller than the number of pixels. Say, each edgel is 5 pixels long. So a line across a $100 \times 100$ image would be on the order of ~20 edgels long.

The situation can be summarized by this commutative diagram:

We know how to do $f$, which we assume is easy. And assume we know how to do $\Phi$. So we know how to do $\Psi = \Phi \circ f$.

Now instead of computing the error with the raw image:

$\Delta_1 = || f(\mathbf{r}) - D_0 || $

we can compare their features-space representations:

$\Delta_2 = || \Psi(\mathbf{r}) - \Phi(D_0) || $

which is more efficient (maybe?).

For our example, we can use $(c,d)$ to generate a set of features $F$, which are edgels each having a position and slope. In symbols:

$(c,d) \stackrel{\Psi}{\mapsto} F = \bigcup_{i=1}^N \{ (x_i, y_i, slope_i) \} $

where the x, y and slopes are simple functions of the parameters $\mathbf{r} = (c,d)$. The error could be defined as some kind of average of differences between individual edgel-pairs.

Notice that $\Psi^{-1}$ may still be difficult to compute.

See, the biggest inefficiency is actually not $\Psi$ itself, but is the size of the parameter space $(c,d)$, which forces us to evaluate $\Psi$ many times in a generate-and-test loop.

We can also see that feature extraction does not directly attack the bottleneck.

It would be very nice if we can directly calculate $\Psi^{-1}$. Then we can simply go from the input data $D_0$ via $\Phi$ and $\Psi^{-1}$ to get our answer, $(c,d)$. That does not require any searching! How can that be done?

Two possible strategies

In fact we have 2 possible strategies:

Along every step of the generative process $f$ or $\Psi$, use only invertible functions.
Use only differentiable functions in every step, so then we can set $\frac{\partial \Delta}{\partial r} = 0$ and so on to find the minimum of the error.

In both strategies, we are writing a functional program using a restricted class of functions. Then the computer can automatically find the inverse or derivatives for us, either numerically or symbolically.

In the first strategy, notice that some commonly-used functions do not have inverses. In particular, if the function is idempotent (ie, $f \circ f = f$), which includes min, max, projection, and set union. Note however, that these functions may be invertible if we allow interval-valued or set-valued functions. (Indeed, in vision we generally accept that one cannot see behind opaque objects, a consequence of the idempotency of occlusion.)

In the second strategy, we may need to create "soft" versions of non-differentiable functions such as min, max, and projection.

These are still rather vague, but I think are promising directions...

What about AI?

A final word is that we are just recognizing images as models composed of geometric primitives (such as cylinders). That is still different from natural labels such as "hands", "trees", "cars", or even descriptions such as "nude descending a staircase".

For the latter purpose, we need a further map, say $\Theta$, from natural-language descriptions to geometric models:

I have an impression that this map $\Theta$ is best represented using logical form. That is because many natural concepts, such as "chairs", do not have representations as connected regions in geometric space. Rather, they are defined via complex and intricate, "intensional" relationships. In other words, logical or relational. The formalism I'm developing is somewhat like logic, somewhat like algebra, and a bit like neurons. One can think of it as a universal language for expressing patterns in general.

Friday, December 7, 2012

Matrix trick and the ontology ball

Genifer's logic has the structure of a semi-ring obeying:

$\mbox{associative addition:} \quad (A + B) + C = A + (B + C) $

$\mbox{associative multiplication:} \quad (A \times B) \times C = A \times (B \times C) $

$\mbox{non-commutative multiplication:} \quad A \times B \neq B \times A $

$\mbox{distribution on the right}\quad (A + B) \times C = A \times C + B \times C $

$\mbox{commutative addition:} \quad A + B = B + A $

$\mbox{idempotent addition:} \quad A + A = A $

In short, multiplication is similar to "concatenation of letters" and addition is like "set union". In programming (data-structure) terms, the algebra is like a set of strings composed of letters, and the letters are what I call concepts.

Matrix trick

The matrix trick is simply to represent the elements of the algebra by matrices. (This may be related to representation theory in abstract algebra).

What is a matrix? A matrix is a linear transformation. For example, this is an original image:

This is the image after a matrix multiplication:

In short, matrix multiplication is equivalent to a combination of rotation, reflection, and/or shear. You can play with this web app to understand it intuitively.

So my trick is to represent each concept, say "love", by a matrix. Then a sentence such as "John loves Mary" would be represented by the matrix product "john $\otimes$ loves $\otimes$ mary".

Each arrow represents a matrix multiplication. The transformations need not occur on the same plane.

By embedding the matrices in a vector space (simply writing the matrix entries as a flat vector), I can "locate" any sentence in the vector space, or cognitive space.

Keep in mind that the rationale for using matrices is solely because $AB \neq BA$ in matrix multiplication, or:

John loves Mary $\neq$ Mary loves John.

Ontology ball

This is another trick I invented:

It is just a representation of an ontology in space (as a ball). To determine if $A \subset B$, we see if $A$ is located within the "light cone" of $B$. Note that fuzzy $\in$ and $\subset$ can be represented by using the deviation from the central axis as a fuzzy measure.

I think this trick is important since it replaces the operation of unification (as in Prolog or first-order logic) with something spatial. Traditionally, unification is done by comparing 2 logic terms syntactically, via a recursive-descent algorithm on their syntax trees. This algorithm is discrete. With this trick we can now perform the same operation using spatial techniques, which is potentially amenable to statistical learning.

Update: Unfortunately I realized that the light-cone trick is not compatible with the matrix trick, as the matrix multiplication rotates the space which may destroy the $\in$ or $\subset$ relations (as they rely on straight rays projecting from the origin to infinity). In fact, a matrix multiplication can be seen as a rotation of an angle $\theta$ which can expressed as a rational multiple of $\pi$ (or arbitrarily close if irrational). Then we can always find a $k$ such that $A^k = Id$ where $Id$ is the identity. In other words, we can find a product of any word, say "love", such that:

love $\circ$ love $\circ$ love ... = $Id$

which doesn't make sense.

I am still thinking of how to combine the 2 tricks. If successful, it may enable us to perform logic inference in a spatial setting.

Thursday, November 29, 2012

Support vector machines

The 3 parts of an SVM

The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm. The 3rd, "kernel trick" is what gives SVM its power.

During the talk I will refer to Cortes and Vepnik's 1995 paper.

A linear classification problem, ie, separating a set of data points by a straight line / plane. This is very simple.
Transforming the primal optimization problem to its dual via Lagrangian multipliers. This is a standard trick that does not result in any gain / loss in problem complexity. But the dual form has the advantage of depending only on the inner product of input values.
The kernel trick that transforms the input space to a high dimensional feature space. The transform function can be non-linear and highly complex, but it is defined implicitly by the inner product.

1. The maximum-margin classifier

The margin is the separation between these 2 hyper-planes:

$ \bar{w}^* \cdot \bar{x} = b^* + k $

$ \bar{w}^* \cdot \bar{x} = b^* - k $

so the maximum margin is given by:

$ m^* = \frac{2k}{|| \bar{w}^* ||} $

The optimization problem is to maximize $m^*$, which is equivalent to minimizing:
$$ \max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || })
\Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k}) \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} ) $$subject to the constraints that the the hyper-plane separates the + and - data points:
$$ \bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b $$

This is a convex optimization problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape:

2. The Lagrangian dual

The Lagrangian dual of the maximum-margin classifier is to maximize:
$$ \max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} ) $$subject to the constraints:
$$ \sum_{i=1}^l \alpha_i y_i = 0, \quad \alpha_i \ge 0 $$
Lagrangian multipliers is a standard technique so I'll not explain it here.

Question: why does the inner product appear in the dual? But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$. This allows us to apply the kernel trick to the dual form.

3. The kernel trick

A kernel is just a symmetric inner product defined as:
$$ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $$By letting $\Phi$ free we can define the kernel as we like, for example, $ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2 $.

In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be replaced by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.

Doing so has the effect of transforming the input space $\mathbf{x}$ into a feature space $\Phi(\mathbf{x})$, as shown in the diagram:

But notice that $\Phi$ is not explicitly defined; it is defined via the kernel, as an inner product. In fact, $\Phi$ is not uniquely determined by a kernel function.

This is a possible transformation function $\Phi$ for our example:
$$ \Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2) $$It just makes $ \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $ equal to $ (\mathbf{x} \cdot \mathbf{y} )^2 $.

Recall that the inner product induces a norm via:

$ ||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}} $

and the distance between 2 points x and y can be expressed as:

$ d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$

Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.

Some questions: What determines the dimension of the feature space? How is the decision surface determined by the parameters $\alpha_i$?

Hilbert Space

A Hilbert space is just a vector space endowed with an inner product or dot product, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.

The Hilbert-Schmidt theorem (used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an orthogonal basis of H. In other words, every vector in space H has a unique representation of the form:

$$ \mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i $$This is called a spectral decomposition.

* * *

As a digression, Hilbert spaces can be used to define function spaces. For example, this is the graph of a function with an integer domain:

with values of $f(1), f(2), f(3), ... $ on to infinity. So, each graph corresponds to one function.

Imagine an infinite dimensional space. Each point in such a space has dimensions or coordinates $(x_1, x_2, ... )$ up to infinity. So we can say that each point defines a function via
$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ... $$
In other words, each point corresponds to one function. That is the idea of a function space. We can generalize from the integer domain ($\aleph_0$) to the continuum ($2^{\aleph_0}$), thus getting the function space of continuous functions.

End of digression :)

Sunday, November 25, 2012

Introducing Genifer, 2012

I started working on AGI in 2004, and my greatest "achievement" up to 2012 (8 years) is the invention of a new logic. All my work thus far is focused on the thinking aspect of AGI, deferring the aspects of planning and acting.

Concept composition logic

Consider a complex sentence:
"John walks to church slowly, thinking about Turing"

It can be decomposed into:

"John walks to church"
"John walks slowly"
"John walks thinking about Turing"

Using the Genifer GUI (github code is here) we can manually create a graph that represent the sentence structure as follows:

The GUI automatically generates a "factorization" formula of the graph.

The key idea of the graph is that if a word A modifies word B, we create an arrow $A \multimap B$. More than 1 word can modify a target, but each word can only modify 1 target. This ensures that the result is a tree, and can be expressed as a sum of products, using $+$ and $\times$.

We can use the distributive law of multiplication to expand the factorized formula.

A problematic case: If we translate these 2 sentences into logic:

"John gives the ball to Mary" and
"John gives $2 to Kevin"

After factorizing and expanding we get:

john gives to mary
john gives the ball
john gives to kevin
john gives $2

so it may be confused that John gives $2 to Mary. But I reckon that the distributive law is not the real cause of the error. The error arises because the 2 "johns" are the same John but the 2 "gives" are different instances of giving.

Variable-free logic

This is an example of the logical definition of "downstairs" from the SUMO ontology:

On the left side is the logic formula, on the right the English translation. As you can see from the translation, it uses a lot of "$n^{th}$ objects" which are logic variables. But notice that all these variables invariably belong to some classes (such as "building_level"). So we can just do without variables and directly use ontology classes for matching.

For example:
"All girls like to dress pretty"
"Jane is a girl"
implies "Jane likes to dress pretty".

In logic the steps would be:
girls like to dress pretty
jane $\in$ girls
$\Rightarrow$ jane likes to dress pretty
because the "girl" atom in both formulas match, so "jane" is substituted into the conclusion.

Parsing of natural language

This is an example parse tree of an English sentence:

A simple grammar rule is:
"A noun phrase (NP) followed by a verb phrase (VP) is a sentence (S)".

In classical AI, which uses first-order predicate logic, we have to parse the English sentence to obtain the parse tree (syntax parsing), and then translate the parse tree into logic (semantic parsing). The final logic formula would be like: loves(john, mary). The semantic parsing process is complicated and involves using $\mathbf{\lambda}$-calculus to represent sentence fragments.

In our new logic we can just write:
NP $\circ$ VP $\in$ S

And we have these as data:
loves $\circ$ mary $\in$ VP "loves mary is a verb phrase"
john $\in$ NP "john is a noun phrase"
$\Rightarrow$ john $\circ$ loves $\circ$ mary $\in$ S "john loves mary is a sentence"

Thus, we have achieved both syntax and semantic parsing using just one logic rule (for each grammar rule)! This is a tremendous improvement over classical AI. (But I would add that my new logic is just a kind of syntactic sugar for classical logic. Nevertheless, the syntactic sugar in this case results in very significant simplification of implementation, whereas the classical approach may be so complicated as to be unusable.)

In summary, my innovation provides a new algebraic format of logic that can change our perspectives and open up new algorithmic possibilities.

Blog Archive

Latex Maths