Saturday, June 26, 2010

Careful Ascent

YKY has suggested that the strategy will be as follows:

  1. 1. start with horn clauses, replacing classic conditional with probabilistic. This relatively simple to reason about, even including abduction and lifted reasoning, with a bayes-net construction algorithm that YKY calls "prolog trick".
  2. Add in negation. If we were dealing with classical conditional, we'd now have full FOL clauses-- so we've got what can be thought of as a probabilistic generalization of FOL clauses. The extra complication to the Bayesian networks is trivial (and doesn't need to fall back on negation-as-failure), but I want to examine some less obvious complications.
  3. Where the "prolog trick" uses FOL resolution (to minimally instantiate a rule so as to apply to the present case), put in HOL resolution so that we can start using HOL predicates and probabilistically generalizing over HOL variables. (The second one there is exciting.) This also seems simple, except for the bit about having to implement HOL unification. Again, I'll see about complications.
Complications with #2

Completeness for FOL and completeness for horn-clauses are, to an extent, different beasts. Horn-clause reasoning is complete with respect to the goal of establishing "concrete" facts (statements which do not include alternatives). FOL completeness, in contrast, wants to be able to reason about truth values for any statement that the system can express.

It isn't hard to apply FOL resolution to Horn clauses to deduce more rules which are consequences of the existing ones (in fact, it's quite easy). It's also not hard to allow conditional queries to the Bayesian networks which we'll be constructing, which would allow us to deduce new rules from existing information.  My worry, though, is that we'll be able to have two rules whose information cannot be integrated by the system.

An example:

A1: P(C|A&B)=1
A2: P(~B|C)=1

We should be able to deduce P(~B|A)=1.

C1: P(A&B&C)=P(A&B) [from A1]
C2: P(B|C)=0 [from A2]
C3: P(B&C)=0 [mult. C2 by P(C)]
C4: P(A&B&C)=P(A|B&C)*P(B&C)=0 [decompose P(A&B&C) and then apply C3]
C5: P(A&B)=0 [from C1, C4]
C6: P(B|A)=0 [divide by P(A), assuming it's not 0]
C7: P(~B|A)=1

Let's call this "rule-level reasoning". YKY has already asserted (private email) that this sort of reasoning is needed for abduction, although I disagree with him there: I think abduction should be handled by the standard Bayes Net belief propagation, rather than by reversing the direction of a rule via probabilistic reasoning. This means that I think, when we do "prolog trick" construction of the baye's net, we don't just look for rules which have consequences which are nodes in our current network; we also look for rules whose premises are in our current network. When we glue those rules onto the network, the consequences are added as uncertain variables, which means we now try to determine their truth. If we find any evidence one way or the other, it gets added to the network, and the network starts doing abductive reasoning automatically via belief propagation.

My impression is that YKY prefers the idea of looking for rule candidates in basically the same way, but before adding the rule to the network, use rule-level reasoning to change its direction so that all reasoning is deduction. I don't see the point of this, but I do agree that rule-level reasoning is needed when we can't paste a rule into the network without creating a directed cycle.

How should rule-level reasoning be implemented? I think the right way of doing it is just to implement the possibility of conditional queries (a standard belief propagation task, which would complicate "prolog trick" network construction just in that we'd be looking for links between the query variables). When we want to perform rule-level reasoning, we just make a query that is in the form of the rule we want to construct. The resulting probability is the conditional probability for the rule! Easy as pie, and we don't have to worry about all the probability algebra stuff I did in the explicit derivation above. (This gets approximated by belief propagation.)

This does not quite tidy up the completeness concerns. Essentially, I want to know that the reasoning is (an approximation of) complete reasoning given the Kolmogorov axioms and the semantics for our probabilistic generalizations, ie, "the probability of a statement with unbound variables is the probability that it will be true under a random instantiation."

Bayes net reasoning (belief prop) is approximately complete for propositional Horn-with-negation networks so long as the rules form no directed cycles. (It is precisely complete if there are no cycles at all. Actually, it's also been proven to be precise for networks with single cycles. The precise conditions for belief prop giving exact answers are unknown.) With rule-level reasoning, it should be complete for rule-sets with (correctable) directed cycles as well. If that were established rigorously (showing for example that this sort of reasoning will not accidentally use a rule twice in different forms and get the probabilities wrong as a result), then what would remain would be precisely the question of whether our reasoning system is complete over our definition of probabilistic generalizations.

An approximate reasoning method that would be "obviously complete" would be to actually take random instantiations and use the frequency of their truth (ie, their average probability) to decide the probability of the generalization. YKY has proposed essentially this technique for the estimation of a generalization's probability. However, we also want to do lifted estimation: deduce the probability of a predicate as a marginal probability in the lifted network constructed by instantiating rules only enough to apply to the predicate. This lifted estimation is certainly correct under the assumption that there is no other relevant information, but certainly not correct for all cases. At present we have no formalism for combining the lifted reasoning with random sampling or other sources of information.

We need one.

(Tackling this will probably be the topic of my next post.)

Moving on to HOL...

Complications with #3

My reservations here arise essentially from my desire to take the "descent" approach I mentioned in my last post: start with system IG, an untyped HOL, and create a "prolog trick" reasoning system compatible with that. YKY's approach is an "ascent" path, which catches me off-guard as to how we should approach soundness and completeness.

HOL is never classically complete, but it can be complete in various weaker senses. A popular option is to use Henkin semantics rather than classical semantics. IG is known to be sound and complete w.r.t. its FOL subset, which is not quite as good; as far as I know, it's Henkin soundness and completeness is an open question. Another appealing approach would be to show soundness and completeness with respect to an interpretation of ZFC, typed HOL, or another standard system. We could investigate other untyped HOLs to find systems complete in stronger senses if we wanted.

That's all for today. :)

5 comments:

  1. Re your #1:

    Notice that the Prolog trick is entirely optional. We should always fall back on the BN formulation to get our bearings, which has rigorous and well-defined semantics.

    I have not completely figured out the "Prolog trick", so it is still an actively developing idea. What I intend is for it to be faster and simpler than BN evaluation; the latter can be a major pain because we can't just use single-valued BNs for AGI (I assume you agree with this?). Most people (and I assume you too) would agree we need at least P(P), and if you agree to normalize the inner P to [0,1] then we get P(Z).

    Re your #2:

    In a BN, the 2 formulas:
    male(X) -> long_hair(X)
    ~male(X) -> long_hair(X)
    become the same Bayesian link. So, it seems that negation is absorbed into the BN formulation.

    In this approach, NAF (negation as failure) seems unnecessary. For example we can say:
    married(john) <0.01> "John is unlikely to be married"
    which can replace the classical logic statement:
    ~married(john)
    and replace the Prolog NAF statement:
    married(john) :- fail.

    As far as the BN formulation is concerned, deduction and abduction are both reduced to BN belief propagation. So they are *solved* problems. The question now is how to use the Prolog trick to approximate them, in order to build the prototype. As I said, there is a strong reason to use the trick, because we need P(P) or P(Z).

    Re your #3:

    It's a bad idea to think of soundness / completeness in terms of the Prolog trick, which is just an approximation / hack. When considering the theoretical soundness / completeness of the logic, *always* refer to the BN formulation. The Prolog trick is an approximation of the BN.

    It is not obvious to me how to reason about the completeness of HOBN. For that you need to consider the simpler case, FOBN, first. What does it mean for FOBN to be complete? The semantics is probabilistic rather than binary. It seems to be a relatively new and unexplored research area, and it may cost us too much time to have the semantics pinned down. Last time I checked, Brian Milch's Bayesian logic (BLOG), as a first-order probabilistic language, deals with probabilistic / possible-world semantics; but I haven't looked into details.

    ReplyDelete
  2. YKY,

    #3: Yes, that is exactly what I'm trying to do: pin down FOBN before HOBN. I'm not trying to make a detailed soundness/completeness proof, though; I'm just trying to do enough that I feel we know what we are doing. The main product of this thinking so far is to realize that we need a more precise theory of how to integrate a probability for a predicate that's been created with lifted reasoning with one that's been created by counting true/false instances.

    #2: You are right.

    #1: Ok. I was misunderstanding the meaning of the term "prolog trick" to some extent. I was thinking of it as the method of constructing partial Bayesian networks, and also the naive independence assumptions that go with them. We need to talk more about this, then...

    ReplyDelete
  3. A better explanation of what the "Prolog trick" is all about:

    For "exact" inference, for each query, we'd generate a BN from a KB of FO/HO rules. Then we evaluate that BN to give the answer.

    The Prolog trick is that we try to calculate intermediate truth values *during* the proof search, ie, calculuating TV's "in situ" in the proof tree while constructing it. Moreover, the trick ignores any long-range probabilistic dependencies -- as exact BN inference would require.

    I *had* the impression that the trick would be faster and easier to code than BN inference. But I'm now having more doubts. Firstly, the gain in speed may be insignificant -- for example, we know that exact BN inference for poly-trees is linear. Secondly, the trickiness and the need to understand the probabilistic approximations require some thinking too -- right now I don't even know what exactly is the trick doing, as I have not laid out the details on paper, but am just doing some hacks guided by intuition.

    Perhaps one thing you can help me do is to figure out how to evaluate the BN quickly and soundly (I mean, even approximations can be somehow justified to be sound).

    ReplyDelete
  4. YKY,

    Well, currently I like the option of using belief propagation for your "exact" inference; this makes the same independence assumptions that you mention for the prolog trick. However, it does not incrementally construct probabilities, which does seem possible (under certain approx. assumptions) and useful... I will try to spell out the details as I see them sometime soon, and we'll see how well that matches with your intuition.

    P(P) would be very different from P(Z) as far as I can see. Doing P(Z) in a BN would be within the realm of existing research, since it's just a continuous-valued BN. P(P) would be incorporating higher-level probability in a BN, which I have not heard of and honestly do not know how to do at first sight... How do you represent the presence of 2 different levels of probability simultaneously? Where do you keep them? S is *not* equivalent to P(S)=1, whereas in P(Z) we do have S equiv. Z(S)=1. This means that we'd be throwing out a layer if we tried to represent P(P) as just a BN of continuous variables.

    For doing belief propagation over P(Z)... remember I mentioned Paul Rosenbloom before? Well, he's got continuous-valued belief propagation all coded up. It might seem a little silly to ask him for his code (he is using it for his own cognitive architecture, still under development), but he's got it working. I do not know what language, etc.

    ReplyDelete
  5. Perhaps P(P) is the wrong notation. You tried to use P(height), ie, a P distribution over heights. So it is simple P after all.

    So I guess your P approach requires BNs with continuous distros. Whereas my P(Z) requires BNs with beta-distros. One additional problem in your P approach is how to represent distros over a great variety of domains (such as tallness, smartness, beautifulness, etc; while tallness is readily convertible to numbers, I wonder how you do the others).

    Whether P or P(Z), Rosenbloom would be very helpful! I was looking for just that kind of BN evaluators and couldn't find any, so I gave up last time =)

    ReplyDelete