## Tuesday, June 29, 2010

### Prolog Trick

My misunderstanding about the nature of the prolog trick has been cleared up a bit, and it is time for an analysis of just what the idea is.

Previously, I thought that the basic idea was simply to construct Bayesian Networks incrementally with a backward-chaining type search inspired by Prolog. However, this is just a part of the idea. The more important thing is the way the probabilities are calculated incrementally as the search progresses. Here's how it works if we support only deduction, not abduction (ie, we only follow baye's net arrows in one direction):

We have a query node, call it A. We might find B&C->A, making new query nodes B and C. We then might find X&Y->C, and find X, and find Y. At this point we have enough information to propagate forward from X and Y to C. We check whether we have enough to propagate even further, but we do not, so we continue the network construction. We could then find B, which would give us enough to propagate again; when we do so, we get A, so we are done.

"Finding X" here means finding a stored probability for the statement X, and propagating along X&Y->C means using the conditional table stored in that BN link to compute a probability distribution for  C.

Now, the interesting thing is, what we're doing here is a single pass of the belief propagation algorithm. A single pass like this is what you do if you only want to marginalize one variable, and of the network you've got is a tree. If you want to marginalize all the variables in a tree, all you need is a second pass, going in the opposite direction; this finishes off the computation for all the other variables with just twice the work. But, no need for twice the work if we only want the one!

If the network isn't a tree, then belief propagation is no longer guaranteed to be correct, but it's been shown that multiple passes (keep doing passes until the numbers appear to converge) can generate good approximations of the correct probabilities in practice. (The exact conditions in which this happens are not well-understood). The prolog trick still just does a single pass for non-tree-like situations, so (in a not-well-understood subset of cases) we should expect worse approximations than we'd get with a bunch of passes. However, one pass will obviously be quicker than a bunch, and it seems reasonable (to me) to expect somewhat graceful degradation.

In any case, here is how you do it. Suppose that node C is connected to rules C&D->A, C&E->B, and F&G->C. We know E, B, F, and G, and from that want to compute C. Suppose also that A is the node which C was invoked in order to estimate, so (via the rules of belief propagation) we ignore it in coming up with our estimate of C. So, then, how do we calculate C?

The basic idea is that we regard the messages coming down as priors and the messages coming up as likelihoods, so (applying Baye's rule) we just multiply them together to get the posterior. That is: the link F&G->C combined with the probabilities F and G will generate a probability for C, call it the "prior" (but keep in mind we're just temporarily treating it as a prior; it's not the prior of C in any long-term sense; indeed, it's an estimate given some evidence). This is a list of C's values and probabilities for each. Using the link C&E->B plus our knowledge of E and B, we can generate a likelihood function: a list of C's values with a probability of the evidence we have concerning B for each. I am going to grind in basic Baye's Law stuff for a sec, in case any readers don't "get it" yet: this is not a probability function. It does not necessarily sum to 1. It is a likelihood function, which means it measures how well different possible values of C would explain the evidence. We multiply it component-wise by the prior on C and normalize the vector to obtain the updated estimate on C (the posterior).

An example. Let's just suppose all these variables have just two states, true and false. Suppose F&G->C stores the following table of probabilities for C being true, given combinations of F and G:

F and G:.1
F and not G: .2
not F, but G: .3
neither F nor G: .4

Then, suppose that we find G and F both have probability .5. We calculate the "prior" for C being true as .025 + .05 + .075 + .1 = .25, with the value for C being false coming to .75 similarly (or by just taking 1-.25).

Now, suppose C&E->B stores the following probabilities for B:

C and E: .6
C but not E: .7
not C, but E: .8
neither C nor E: .9

Further suppose we know that B and E have probability 1. We calculate the likelihood for C being true as .6, and the likelihood for C being false as .8. [The likelihood of C being true will be: ((the probability of B being true supposing C is true and using our probability for E being true or false) times (the probability we had for B being true to start with)) plus ((the probability of B being false supposing C is true and using our probability for E being true or false) times (the probability we had for B being false to start with)). The likelihood for C being false will be the same replacing "supposing C is false" with "supposing C is true".]

Now, we multiply the two vectors together and normalize! The prior is [.25, .75] and the likelihood is [.6, .8]. Multiplying, we get [.15, .6]. Normalizing, we get [.2, .8], which is our final message for C which we'd pass along to help construct a probability for A.

To sum up: adding abduction to the prolog trick calculations can be accomplished via the belief prop rule, which says to multiply the deductive value (what I called the "prior" value) by a likelihood vector, and then normalize the resulting vector.

1. We need to make the explanation simpler, as AGI spans many areas and most people are unfamiliar with one area or another. Do you mind if I edit your post?

Now, sorry to continue with the vulgar example:

rotten-eggs -> h2s
farting -> h2s
farting -> farting-noise

Your solution can increase the probability of "farting" but does not seem to decrease the probability of "rotten eggs". Still, I think the answer is good enough =)

Another question: if "rotten-eggs" is found to be true, we should also expect P(farting) to decrease, even when "farting-noise" is unknown (ie, at baseline).

2. YKY,

It's true-- my post is a mix between just talking towards you and providing a bit more context, so it does not read well. Feel free to edit! I'll try to aim at a broader audience in the future...

It should decrease the resulting probability of a query for "rotten eggs" if the probability of "noise" gets increased... suppose h2s=true for simplicity, and let's specify some simplistic numbers...

P(h2s|egg)=1
P(h2s|fart)=1
P(h2s|neither)=0
P(noise|fart)=1
P(noise|no fart)=0
P(fart)=.5
P(egg)=.5

So, suppose first that our sub-query for "noise" (as part of a larger query for "egg") returns .5. [IMPORTANT NOTE: abductive queries like this *cannot* be satisfied by deductive information! An independent cause for "noise" might appear to increase its probability and therefore increase the probability of "fart", but this is wrong! This emerges from the rules of belief propagation, because if we found a new rule S->noise, we'd have to combine it with the existing fart->noise rule, so we'd end up with just one rule pointing to "noise". We then would get the *right* answer when we calculate likelihood, namely that the addition of an independent cause S actually explains away the influence of "noise" rather than increasing it! So, the .5 result I'm using in the example would *have* to come from further *abductive* queries. This has to be dealt with carefully in the algorithm; sorry I didn't realize it was a bit of a complication in the original post. Note also that the way in which we combine rules becomes fairly important here for the meaning of an individual rule.]

The likelihood for "fart" will be (.5*1 + .5*0 = ) .5, and for "no fart" will be (.5*0 + .5*1 = ) .5. The prior is also [.5, .5] so multiplying and normalizing leaves [.5, .5]. Thus the likelihood of "egg" given "h2s" is 1, and the likelihood of "no egg" given "h2s" is .5. Multiplying and normalizing we get the query result, that "egg" has probability 2/3.

Now suppose instead that "noise"=true. The likelihood for "fart" will be 1, and for "no fart" will be 0, making the result for "fart" (multiplying by the prior and then normalizing) [1, 0]. The likelihood for "egg" is now [1, 1], so (multiplying by the prior and normalizing) we get a probability of .5 for "egg".

"h2s" has been explained away and no longer increases the probability of "egg". :)

3. (I hate this example too!! ...but can't think of a better one at the moment...)

The first point is about abduction. Abduction means going from some facts (maybe call them "evidence") to some "explanations" / "abducibles" that can derive / prove the evidence.

The problem is, in the fart example, abduction means going from the facts:
1. "noise"
2. "h2s"
to all possible explanations.

The problem here is that we will not be running EXPLICIT queries on "fart" or "eggs". These have to be searched outward from "noise" and "h2s".

And that presents a problem, because I'm still envisioning a KB where rules are stored individually, such as:
fart -> h2s
fart -> noise
egg -> h2s
fart, egg -> h2s

Why I do this is because the real KB may have many rules about a concept, eg:
fart -> embarrassment
fart -> indigestion
fart -> methane
egg -> good source of protein
egg -> has vitamins
egg -> is not strictly vegetarian
egg -> can be brown or white
egg -> fragile
.... etc ....

So I'm guessing the rules should be stored individually. But this creates problems for abduction.

Secondly: I don't understand how you calculate your likelihoods. From Pearl's book the formula should be:
likelihood = P(e|H)/P(e|~H) where e = evidence, H = hypothesis

4. YKY,

Sorry, I should have been more explicit about my definition of likelihood. Pearl calls the number you mention a _*likelihood ratio*_.

My terminology is Wikipedia's:

http://en.wikipedia.org/wiki/Likelihood_function

Simply, the probability of A given B is P(A|B), but the likelihood of A given B is P(B|A). A probability function is defined f(x)=P(x|E), and obeys the standard probabilistic axioms including summing to 1, whereas a likelihood function is defined f(x)=P(E|x), and doesn't necessarily sum to 1 or do other things probabilities do.

In any case, I will try and put together some slides explaining my math.

Your point about queries. If you don't want to run explicit queries for the different explanations, that's a different use case ;). What you want is not a marginalization of one specific variable anymore-- you'll want to marginalize a bunch of things and then look through them. This means Prolog Trick is no longer what you want, and you should be doing full belief propagation! *However*-- incorporating abduction into the prolog trick is still important, as it improves the quality of results for the single-query use case.

Finally, your point about storing separate rules: yes, this is fine. What I'm working with in the post is, when the Bayes Net gets constructed, the rules need to be combined in some way to form a conditional probability table (such as the noisy-or combination). So, if we're really using un-combined rules, the combination will happen in the likelihood calculation.

5. RE: storing Bayesian rules in the KB.

My current approach is: store rules individually and generate CPTs on-the-fly.

An alternative is to collect rules by their heads (goals), and store "collective" CPTs in some way.

From the computer memory's perspective, there isn't much difference.

From the computational perspective, a big collective CPT may be clumsy to work with, when all we need is just an inference step with a few factors.

6. Re abduction: IMO abduction is not merely reasoning in the reverse direction of a Bayesian link. My definition of abduction is "finding explanations of facts".

The Prolog trick *definitely* can find explanations in the abductive direction. Its problem is that it ignores long-range probabilistic dependence, thus unable to do the analog of "explaining away". Maybe there's a way to solve this problem without doing full belief propagation?

Also, the abduction algorithm is almost identical to backward chaining (the search space is identical), and now I also see a similarity with forward chaining (going out from known facts to unknown directions). Maybe we can just combine BC and FC techniques to make abduction?

7. You are right, the use-case for abduction that you are interested in seems like forward-chaining (in the sense of starting with a known and computing probabilities for unknowns, rather than starting with an unknown and looking for knowns to compute from) but restricted to the abductive direction, ie, forward-chaining that happens backwards along BN arrows.

We have a terminology problem now, though. What terms should we use to distinguish between backwards-chaining probability calculations that utilize rules in the reverse direction (ie, what I explain how to do in the post) and a *search* for explanations (ie, what you are saying you want)?

I'm tempted to suggest that mine is "adding abduction to backwards chaining" and yours is "adding abduction to forward chaining," so that I can keep calling the likelihood math "abduction" (and the math that uses rules in their forward direction "deduction"). Abduction/deduction is then orthogonal to forward-chaining/backward-chaining.

What do you think? Can we resolve the terminology issue? It seems we need to before proceeding.

8. Abduction SHOULD be defined as "finding explanations", IMO. That accords with the definition of induction, "finding hypotheses".

However, your "utilizing rules in the reverse direction" is just plain old Bayesian reasoning -- precisely what Bayes rule allows to do.

9. So, I'm "adding likelihood reasoning to prolog-trick calculations", which is "useful for abduction", yes?

10. Yes, looks like that'll be useful.

We need a term for "reasoning in the abductive direction of rules".

Last time I had that in my Lisp code, I encountered a problem where rules go cyclic. For instance the KB has a rule "A -> B", and the user query B. Then A becomes a subgoal. Reasoning in the abductive direction, we get back B. And so on... (My solution was to allow this, but give such abductive steps lower priority).

As of now the Prolog trick has 2 problems:
1. unable to "explain away", and its analog in the abductive direction
2. cycles may cause inefficiency

But, I guess using likelihoods is a good move anyway....

11. YKY,

> We need a term for "reasoning in the abductive direction of rules".

I'll just call that "likelihood propagation" or "likelihood reasoning" if that's ok.

1. What I'm trying to tell you is that explaining away works fine if you do the likelihood math!
2. Don't we already have to deal with this in the, um, deductive direction of reasoning, if we allow searching for more rules with A as a conclusion after we find some B->A (to improve the estimate of A)? ie, in general we need to avoid adding the same link to the network twice?

Sigh. Terminology suggestions:

*deductive propagation*: propagation of probabilities forward on links (this is not optimal-- any other suggestions to indicate the forward direction?)
*likelihood propagation*: propagation backwards on links

12. Hmm. For the moment instead of "deductive propagation" I will use "cause-based reasoning" for propagation forward on Bayes Net arrows, and contrast it with "evidence-based reasoning" for propagation backwards along Bayes Network arrows. If anyone has any better suggestion, please chime in!!

13. PS: I know YKY wants to distinguish causal reasoning from bayes net reasoning! That is one reason why the above is not good terminology... any other suggestions welcome...

14. PPS: YKY has suggested "forward propagation" and "backward propagation" which I'll use for the moment.

15. Re: not using the same Bayesian link during a proof search:

In FOL, repeatedly applying a formula is allowed (eg, if a formula uses the successor function S). This is one reason why FOL is undecidable.

Similarly, I guess the KB can generate BNs by repeatedly applying one formula. Of course, the generated BN should not have duplicated links.

In the BC (backward chaining) algorithm, this is still a problem, as it could be costly to check for duplicated links during the proof-tree search. So my solution was to allow them, but with lower priority. This remains an open problem as of now...