Saturday, December 20, 2014

Cognitive micro-actions

Consider this definition of "grand father" in predicate logic:

$$\mbox{father}(X,Y) \wedge \mbox{father}(Y,Z) \rightarrow \mbox{grandfather}(X,Z).$$

The use of variables makes this formula rather long and unnatural.  Long formulas are difficult to learn via inductive learning.

My latest idea is to introduce procedural elements into logic so that certain operations intuitive to the human mind can be carried out via very short instructions.

Working memory structure

Cognitive science studies have shown that the human brain can hold approximately "7 plus or minus 2" items in working memory.

Imagine that working memory is like a data structure (eg a linked-list, tree, stack, queue, graph, etc..) that can be programmed with micro-instructions.



In logic, every problem must be fashioned as deduction.  This makes certain simple operations complicated to define.  For example, this is the definition of "append" in Prolog:

$$\mbox{append [Head|Tail] to List gives Head | List2} ← \mbox{append Tail to List gives List2}$$

which is 13 symbols long.

Notice that the variables are inter-dependent among conjuncts in the formula:


This creates complications in pattern matching (unification) and substitution.

Suppose that Genifer's working memory has a linked-list tool kit, with micro-instructions such as:

  • focus on first list item
  • move to next list item
  • replace focused list item with X
  • insert X before current focus
  • etc...
Such instructions are very short and are usually atomic or have only 1 argument.  This is a significant advantage during machine learning.



Eliminating variables in logic

Consider again the grandfather formula:



$$\mbox{father}(X,Y) \wedge \mbox{father}(Y,Z) \rightarrow \mbox{grandfather}(X,Z).$$


Relation algebra (not to be confused with "relational algebra" in database theory) offers a way to eliminate variables.  It can be regarded as a form of combinatory logic (the major theory of eliminating variables), focused on relations.

For example, the grandfather example is formulated as:

$$\mbox{father} \circ \mbox{father} = \mbox{grandfather}$$

which is similar to natural-language "father's father is grandfather".  Notice that the above is a complete statement about the "operators" father⚫ and grandfather⚫.

Look at the following inference carried out in relation algebra, using equations purely, and substitution of equal by equals:

$\mbox{john = father paul}$

$\mbox{paul = father pete}$

$\mbox{john = father father pete = grandfather pete}$

Also notice how similar the above derivation is to natural language.


Genifer logic

I suggest to use an eclectic mix of logical and procedural elements;  such a set will almost certainly be Turing universal, if it is not too impoverished.
  • equations
  • probabilistic conditioning (Bayesian arrow)
  • subsumption


Learning

A hybrid formula of the form:

$$conditions \rightarrow action$$

is neither logical nor procedural, but it can be learned via logical inductive learning or reinforcement learning.

  • In logical learning, the above conditional statement is construed as "if conditions are satisfied, it would be appropriate to carry out the action".  Such a statement becomes declarative and thus can be learned via logical inductive learning.
  • In the procedural setting, when a reward is received, the sequence of actions would be recorded and it would include the above hybrid formula. So the formula can be up-scored as usual.
These 2 learning processes may occur independently.


Monday, December 15, 2014

Searching in a lattice (在格中搜寻)

[ From now on I will write in English, though my English is less fluent than Chinese, but I find it more important to address international partners. ]

Last time I briefly introduced inductive learning;  how the search can be conducted in 2 directions: generalization (一般化) and specialization (特殊化).

Let me first introduce the simplest searching algorithm:  binary search (二分法寻找).


Binary search

Remember the game "20 Questions", in which one player thinks of a "thing" and the other player tries to guess what it is by asking only YES / NO questions.

It may be a bit surprising that an arbitrary thing in the mind can be categorized only by 20 yes/no questions, but that is an illustration of the power of binary search.

The number of distinct things that can be specified by 20 questions is $2^{20} = 1048576$.  That is because, at each level, the number of nodes is doubled.  In other words, the number of nodes grows exponentially as powers of 2 (以2的幂级数作指数上升).


If I play the 20 Questions game, (given that I don't have any clue as to the player's preference), I would first ask:  "Is the thing physical or abstract?"  and if physical I would ask "Is it natural or artificial?"  or if abstract I would ask "Is it (good or bad) or neutral?"

The point is that I would try to divide the classes into roughly half-and-half proportions.  That is going to maximize the information I can gain.  If, however, I ask the first question "Is it spaghetti with meatballs?" it would be too specific and a waste of chance to narrow down the domain.

The more I can split the domain into 1/2, the more optimal my efficiency.  The more it deviates from 1/2 would be the less efficient.

The simplest case is binary search over a list of numbers (ie, we search for a given number in a list of unknown numbers;  all we know is that the list is sorted from small to big).

This is a nice video explaining the binary search algorithm:
There are many more web pages explaining binary search, so I will skip it here.

Notice that in mathematics, we have what are called order relations (大小关系).  For example the natural numbers can be ordered as follows:

1 < 2 < 3 < 4 < 5 < 6 .... $\infty$

and the integers can be ordered similarly:

$-\infty$ .... -5 < -4 < -3 < -2 < -1 < 0 < 1 < 2 < 3 < 4 < 5 .... $\infty$

The rational numbers can also be ordered as a chain, but the chain cannot be listed out because there is always another rational number between every 2 of them.  But these numbers still form a chain (链) in the sense that any 2 number can be compared and ordered (sorted).

The computational complexity (计算复杂度) of binary search is $O(\log N)$, meaning that if we have a list of $N$ numbers, it would take on average $\log N$ steps in the algorithm to find the position of a target.  Mathematically, the relation of $a^N$ and $\log_a N$ are inverses (互逆运算) of each other.

In other words, when the domain (ie, the search space) is of size $N$, the number of search steps is the logarithm of $N$.  Remember that when we take the exponent (power, $n$ 次方) of something, the result gets very large.  Conversely, when we take the logarithm of something, the result is shrunk to very small.  For example, taking the logarithm in base 10 is like counting the number of digits in a number.  So Bill Gates (a multi-billionaire) 's wealth is a 10-digit number.  My wealth is 5-digit number.

The idea of binary search (or in general, splitting the domain successively into equal partitions) is very important because it allows us to "take the logarithm" of the search space, shrinking the complexity to a manageable size.


Lattices (格)

Now we turn to our main concern.  For logic-based machine learning, we conduct a search in the hypothesis space (假设空间).  That space is organized along a general-to-specific axis (由一般到特殊的方向轴).  Even more difficult is that the elements are not ordered as a chain.  They form a lattice which is a mathematical generalization of an order relation.

We can denote $F_1 > F_2$ to mean "Formula 1 is more general than Formula 2".  However, not all formulas can be compared by generality.  This is why our search space is not a simple chain but a lattice.

For example, we may say that a woman is someone who has long hair, breasts, and wears skirts:
    $F_1:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}$

But we can be slightly more specific, adding that she has finer facial features, painted fingernails, speaks in a softer voice, etc:
    $F_2:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}, \mbox{fine facial features}, \mbox{painted nails}, \mbox{softer voice}$

So $F_1 > F_2$ (Formula 1 is more general than Formula 2).

We can also be more general by saying:  "a woman is anyone who has long hair":

    $F_3:  \mbox{woman} \leftarrow \mbox{long hair}$

So $F_3 > F_1$.

Generally, we see that if we add more terms to the conditional side of a formula, the more specific it becomes (because the conditions are harder to satisfy);  and if we remove terms from the conditions, the formula becomes more general (easier to satisfy).

But how about this alternative definition:  "a woman is someone with XX chromosome and female genitalia":

    $F_4:  \mbox{woman} \leftarrow \mbox{XX chromosome}, \mbox{female genitalia}$

The formula $F_4$ cannot be compared with the above formulas along the general-specific axis, because it removes some conditions and adds new ones at the same time.  But $F_4$ is also located along some chains of other formulas.

The general situation is that:  not all formulas can be compared with each other, but for any 2 formulas, we can always find a third one that is more general (or specific) than the two.

As an example, this is a lattice (a line going from high to low means ">"):


This is also an example of a power set (幂集).  Given a set, say, $\{a, b, c\}$, the power set is the set of all its subsets.  For example, $\{a, b\}$ is one subset, $\{c\}$ is another subset.  The empty set $\emptyset$ is also a subset.

Here, ">" means the superset relation.

Notice that the power set in the above example has 8 elements, and $8 = 2^3$.  The 3 is because the original set $\{a, b, c\}$ has 3 elements.  In general, the power set of a set $A$ has $2^{|A|}$ elements, where $|A|$ denotes the number of elements in $A$, ie, the size of $A$. 
In mathematics, everything has an explanation.  There is not a single thing in mathematics that has no explanations.  Why does the power set have $2^{|A|}$ elements?  Let's try to list all the elements of the power set of $\{a, b, c\}$.
From the bottom of the lattice diagram:   First we list the empty set (the set with 0 elements).  Then we list the sets with 1 element:  $\{a\}$, $\{b\}$, and $\{c\}$.  Then we list the sets with 2 elements:  $\{a, b\}$, $\{a, c\}$, and $\{b, c\}$.  Then we list the set with all 3 elements:  $\{a, b, c\}$.  Then we're done. 
Another way to do this, is to ask, for each element of $\{a, b, c\}$, whether to include it in a set or not.  So for each element we are asking a YES / NO question.  For each question we can have 2 answers.  So there would be a total of $2 \times 2 \times 2$ different answers.  Because for each element we can have 2 choices:  whether to include it or not. 
Even more generally, if we have a mapping from a set $A$ to a set $B$, the total number of different mappings from $A$ to $B$ is equal to $|B|^{|A|}$.  For example, if we are trying to match 5 boys to 3 girls, there would be $3^5 = 243$ different mappings (multiple boys can map to the same girl, but each boy cannot choose more than 1 girl).  That is different from $5^3 = 125$ where multiple girls can choose the same boy, but each girl can only choose 1 boy. 
That is the meaning of exponentiation, from a source to a set of choices, $\mbox{choices}^\mbox{source}$.  The size of $B^A$ is the number of possible functions (mappings) from $A \rightarrow B$. 
Even if the sets are uncountable, such as the points on the line interval [0,1], the exponentiation can be understood as functions from a set to another set.

Notice that in the lattice above, $\{a\}$ cannot be compared to $\{b\}$, but $\{a, b\} > \{a\}$ and $\{a, b\} > \{b\}$, so we can find an element that is "bigger" than both $\{a\}$ and $\{b\}$.

Also notice that the common element that is bigger than both $x$ and $y$ is denoted $x \vee y$.  Such an element should always be unique in a lattice.  This means that the following diagram is forbidden (for lattice orders):


because the red lines and blue lines both indicate a possible $x \vee y$.

This is also not allowed:

because both $y \vee z = z$ and $y \vee z = w$ are possible.

These are some more complicated, but acceptable, lattices:






Conclusion

In Genifer, our learning algorithm needs to search within the lattice of formulas (hypotheses).  Whenever Genifer experiences some new examples, it searches for the most general generalization (mgg) of the given examples, as long as the hypothesis does not contradict her existing knowledge.

It is still a somewhat open question as to how to conduct this search efficiently.  We wish to find a fast algorithm analogous to binary search along a chain, but within a lattice.  Maybe the reader can help me solve this problem?

Sunday, December 7, 2014

什么是贝叶斯网络? (What are Bayesian networks? in Chinese)

命题和它们之间的关系


上次介绍过「命题逻辑」,命题即是能赋予真假值 (truth value) 的东西。  例如「昨天下雨」、「人是动物」等。

命题之间有「关系 relations」,例如:「吃了不洁食物」→「肚子痛」;
记作 P → Q 或叫「P 蕴涵 Q」。

这个「箭头」是一切逻辑中最重要的,有了它才可以有「推导 deduction」这回事,否则知识变成一堆离散的没关连的点子。

宇宙无限,但人脑有限,以有限的脑怎能理解无限的世界? 那是靠 general rules。  那些 rules 里面有变量 (variables),所以逻辑 formulas 可以像 template (模版) 那样套到不同的实例上,然后推出不同的结论,於是「经验世界」可以被压缩成有限的资讯。

总言之,变量压缩的基础,箭头推导的基础。


机率


但,在经典逻辑中,所有命题都是非真即假 (binary)的,这当然有局限,所以我们这一代的研究者都不用经典逻辑了。   我的 Genifer、Ben Goertzel 的 OpenCog、王培的 OpenNARS,我们都设计了某种 "uncertainty logic"。



我使用的办法是经典的概率论加上模糊性 (fuzzy);  例如,如果说「玛莉很高」,那概率的分布就可以像下图的红色曲线那样:


即是说,玛莉的高度,是以不确定的概率分布,而最有可能的高度就是尖端的 peak 位置。  注意,我不标示真正的高度,而是用 [0,1] 这区间来表示「高」的程度,例如 0.7 是比平均高的, 0.5 则是平均。

但如果不确定性增加的话,那曲线会变得更扁,例如:

还有,机率的特性,就是它分布的曲线下面的总面积一定是 1,因为无论玛莉的高度如何分布,所有可能性的总和就是一个「必然事件」,而必然事件的机率是1。 


Bayesian network

以下是一个 Bayesian network 的例子 :

每一个箭头,如果是由 X 指向 Y,便表示 "Y given X",亦即是「如果知道了 X 发生的话,Y 的机会是多少」。

例如在图中可看到,吸烟影响肺癌、肺癌影响疲倦、肺癌也影响 X-ray 的结果。

在上图中每个箭头,都附带有一些 P(X|Y) 那样的数据,如下表所示:
(不需深究)

P(X1=no)=0.8
P(X1 = yes)=0.2
P(X2=absent | X1=no)=0.95
P(X2=absent | X1=yes)=0.75
P(X2=present | X1=no)=0.05
P(X2=present | X1=yes)=0.25
P(X3=absent | X1=no)=0.99995
P(X3=absent | X1=yes)=0.997
P(X3=absent | X1=no)=0.00005
P(X3=absent | X1=yes)=0.003
P(X4=absent | X2=absent, X3=absent)=0.95
P(X4=absent | X2=absent, X3=present)=0.5
P(X4=absent | X2=present, X3=absent)=0.9
P(X4=absent | X2=present, X3=present)=0.25
P(X4=present | X2=absent, X3=absent)=0.05
P(X4=present | X2=absent, X3=present)=0.5
P(X4=present | X2=present, X3=absent)=0.1
P(X4=present | X2=present, X3=present)=0.75
P(X5=absent | X3=absent)=0.98
P(X5=absent | X3=present)=0.4
P(X5=present | X3=absent)=0.02
P(X5=present | X3=present)=0.6

之所以叫 Bayesian network,是因为 Bayes 这个和牛顿同期的数学家:


他发现了一个公式可以用来计算 P(X|Y) 那样的机率 :


亦可以更为对称地写成:
$ P(A|B) P(B) = P(A,B) = P(B|A) P(A) $

其中 P(A,B) 是 "A and B" 都发生的机率,和 "A given B" 不同。

这公式在初中或高中会学到,不是很难明。

贝叶斯网络的难处,是在於网络把不同的机率连系起来了,所以 Bayes rule 要一个套着一个地连续运用,变成很复杂、很多 sum of products。  但其实如果有耐性和细心的话,那个算法应该不难,因为除了 Bayes rule 与及机率的基本运算之外, 就没有其他了。 


参考资料


以前经典 AI 不用机率,直到 1980's Judea Pearl 写了一本很详细分析 Bayesian network 的书,使它变成主流:



Pearl 的儿子是美国驻伊拉克记者,他们家是犹太人,他不幸被恐怖分子捉了而被割首,过程被摄影下来。  我不能断定他是极端亲美,还是作为一个同情者而被无辜杀害。

另外这个是 Daphne Koller,她是这本新的巨著 (2009 年,1280 pages!) 的作者之一:



但这本书太 technical 和详细,对初学者不好读。

在 Stanford U 的 Daphne 是 Coursera 的创始人之一,她的(免费的)课 https://www.coursera.org/course/pgm 详细地讲解 Bayesian network 的算法和学习法。  

顺带一提,这个是 Peter Norvig,他和 Stewart Russell 合著的 "AIMA" 是现时最好的人工智能教科书。