Monday, December 15, 2014

Searching in a lattice (在格中搜寻)

[ From now on I will write in English, though my English is less fluent than Chinese, but I find it more important to address international partners. ]

Last time I briefly introduced inductive learning;  how the search can be conducted in 2 directions: generalization (一般化) and specialization (特殊化).

Let me first introduce the simplest searching algorithm:  binary search (二分法寻找).


Binary search

Remember the game "20 Questions", in which one player thinks of a "thing" and the other player tries to guess what it is by asking only YES / NO questions.

It may be a bit surprising that an arbitrary thing in the mind can be categorized only by 20 yes/no questions, but that is an illustration of the power of binary search.

The number of distinct things that can be specified by 20 questions is $2^{20} = 1048576$.  That is because, at each level, the number of nodes is doubled.  In other words, the number of nodes grows exponentially as powers of 2 (以2的幂级数作指数上升).


If I play the 20 Questions game, (given that I don't have any clue as to the player's preference), I would first ask:  "Is the thing physical or abstract?"  and if physical I would ask "Is it natural or artificial?"  or if abstract I would ask "Is it (good or bad) or neutral?"

The point is that I would try to divide the classes into roughly half-and-half proportions.  That is going to maximize the information I can gain.  If, however, I ask the first question "Is it spaghetti with meatballs?" it would be too specific and a waste of chance to narrow down the domain.

The more I can split the domain into 1/2, the more optimal my efficiency.  The more it deviates from 1/2 would be the less efficient.

The simplest case is binary search over a list of numbers (ie, we search for a given number in a list of unknown numbers;  all we know is that the list is sorted from small to big).

This is a nice video explaining the binary search algorithm:
There are many more web pages explaining binary search, so I will skip it here.

Notice that in mathematics, we have what are called order relations (大小关系).  For example the natural numbers can be ordered as follows:

1 < 2 < 3 < 4 < 5 < 6 .... $\infty$

and the integers can be ordered similarly:

$-\infty$ .... -5 < -4 < -3 < -2 < -1 < 0 < 1 < 2 < 3 < 4 < 5 .... $\infty$

The rational numbers can also be ordered as a chain, but the chain cannot be listed out because there is always another rational number between every 2 of them.  But these numbers still form a chain (链) in the sense that any 2 number can be compared and ordered (sorted).

The computational complexity (计算复杂度) of binary search is $O(\log N)$, meaning that if we have a list of $N$ numbers, it would take on average $\log N$ steps in the algorithm to find the position of a target.  Mathematically, the relation of $a^N$ and $\log_a N$ are inverses (互逆运算) of each other.

In other words, when the domain (ie, the search space) is of size $N$, the number of search steps is the logarithm of $N$.  Remember that when we take the exponent (power, $n$ 次方) of something, the result gets very large.  Conversely, when we take the logarithm of something, the result is shrunk to very small.  For example, taking the logarithm in base 10 is like counting the number of digits in a number.  So Bill Gates (a multi-billionaire) 's wealth is a 10-digit number.  My wealth is 5-digit number.

The idea of binary search (or in general, splitting the domain successively into equal partitions) is very important because it allows us to "take the logarithm" of the search space, shrinking the complexity to a manageable size.


Lattices (格)

Now we turn to our main concern.  For logic-based machine learning, we conduct a search in the hypothesis space (假设空间).  That space is organized along a general-to-specific axis (由一般到特殊的方向轴).  Even more difficult is that the elements are not ordered as a chain.  They form a lattice which is a mathematical generalization of an order relation.

We can denote $F_1 > F_2$ to mean "Formula 1 is more general than Formula 2".  However, not all formulas can be compared by generality.  This is why our search space is not a simple chain but a lattice.

For example, we may say that a woman is someone who has long hair, breasts, and wears skirts:
    $F_1:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}$

But we can be slightly more specific, adding that she has finer facial features, painted fingernails, speaks in a softer voice, etc:
    $F_2:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}, \mbox{fine facial features}, \mbox{painted nails}, \mbox{softer voice}$

So $F_1 > F_2$ (Formula 1 is more general than Formula 2).

We can also be more general by saying:  "a woman is anyone who has long hair":

    $F_3:  \mbox{woman} \leftarrow \mbox{long hair}$

So $F_3 > F_1$.

Generally, we see that if we add more terms to the conditional side of a formula, the more specific it becomes (because the conditions are harder to satisfy);  and if we remove terms from the conditions, the formula becomes more general (easier to satisfy).

But how about this alternative definition:  "a woman is someone with XX chromosome and female genitalia":

    $F_4:  \mbox{woman} \leftarrow \mbox{XX chromosome}, \mbox{female genitalia}$

The formula $F_4$ cannot be compared with the above formulas along the general-specific axis, because it removes some conditions and adds new ones at the same time.  But $F_4$ is also located along some chains of other formulas.

The general situation is that:  not all formulas can be compared with each other, but for any 2 formulas, we can always find a third one that is more general (or specific) than the two.

As an example, this is a lattice (a line going from high to low means ">"):


This is also an example of a power set (幂集).  Given a set, say, $\{a, b, c\}$, the power set is the set of all its subsets.  For example, $\{a, b\}$ is one subset, $\{c\}$ is another subset.  The empty set $\emptyset$ is also a subset.

Here, ">" means the superset relation.

Notice that the power set in the above example has 8 elements, and $8 = 2^3$.  The 3 is because the original set $\{a, b, c\}$ has 3 elements.  In general, the power set of a set $A$ has $2^{|A|}$ elements, where $|A|$ denotes the number of elements in $A$, ie, the size of $A$. 
In mathematics, everything has an explanation.  There is not a single thing in mathematics that has no explanations.  Why does the power set have $2^{|A|}$ elements?  Let's try to list all the elements of the power set of $\{a, b, c\}$.
From the bottom of the lattice diagram:   First we list the empty set (the set with 0 elements).  Then we list the sets with 1 element:  $\{a\}$, $\{b\}$, and $\{c\}$.  Then we list the sets with 2 elements:  $\{a, b\}$, $\{a, c\}$, and $\{b, c\}$.  Then we list the set with all 3 elements:  $\{a, b, c\}$.  Then we're done. 
Another way to do this, is to ask, for each element of $\{a, b, c\}$, whether to include it in a set or not.  So for each element we are asking a YES / NO question.  For each question we can have 2 answers.  So there would be a total of $2 \times 2 \times 2$ different answers.  Because for each element we can have 2 choices:  whether to include it or not. 
Even more generally, if we have a mapping from a set $A$ to a set $B$, the total number of different mappings from $A$ to $B$ is equal to $|B|^{|A|}$.  For example, if we are trying to match 5 boys to 3 girls, there would be $3^5 = 243$ different mappings (multiple boys can map to the same girl, but each boy cannot choose more than 1 girl).  That is different from $5^3 = 125$ where multiple girls can choose the same boy, but each girl can only choose 1 boy. 
That is the meaning of exponentiation, from a source to a set of choices, $\mbox{choices}^\mbox{source}$.  The size of $B^A$ is the number of possible functions (mappings) from $A \rightarrow B$. 
Even if the sets are uncountable, such as the points on the line interval [0,1], the exponentiation can be understood as functions from a set to another set.

Notice that in the lattice above, $\{a\}$ cannot be compared to $\{b\}$, but $\{a, b\} > \{a\}$ and $\{a, b\} > \{b\}$, so we can find an element that is "bigger" than both $\{a\}$ and $\{b\}$.

Also notice that the common element that is bigger than both $x$ and $y$ is denoted $x \vee y$.  Such an element should always be unique in a lattice.  This means that the following diagram is forbidden (for lattice orders):


because the red lines and blue lines both indicate a possible $x \vee y$.

This is also not allowed:

because both $y \vee z = z$ and $y \vee z = w$ are possible.

These are some more complicated, but acceptable, lattices:






Conclusion

In Genifer, our learning algorithm needs to search within the lattice of formulas (hypotheses).  Whenever Genifer experiences some new examples, it searches for the most general generalization (mgg) of the given examples, as long as the hypothesis does not contradict her existing knowledge.

It is still a somewhat open question as to how to conduct this search efficiently.  We wish to find a fast algorithm analogous to binary search along a chain, but within a lattice.  Maybe the reader can help me solve this problem?

No comments:

Post a Comment