Geniferology: rule-level reasoning

Showing posts with label rule-level reasoning. Show all posts

Thursday, August 5, 2010

My Current HOL Algorithm

The following is equivalent in strength to HOL Light. Continues this post.

As an extension of the notion of term (defined here), add the notation of polymorphically typed lambda calculus, an equality predicate, and the choice operator (I'll call it epsilon-- it's a function that takes predicates and forms a term that represents an arbitrary satisfier of the predicate, if one exists; otherwise, it just represents an arbitrary term).
Take terms with boolean type as a drop-in replacement for atomic statements in that design. (It may seem strange to use them to replace atomic statements, since they are compound statements, but bear with me. Rules will correspond to valid sequents, rather than to compound statements in the logic.)
Use lambda-calculus unification as a drop-in replacement for first-order unification. This includes normalization; we could either store the normal forms of statements in the KB or let the unification algorithm normalize on the spot, depending on whether space efficiency of the KB or time efficiency of the unifier is a bigger concern.
Implement equational reasoning. Fundamentally, we treat an equality statement which has a given truth value, T, as if it represented two rules with that truth value: one justifying inferences in each direction, substituting one term for another. These rules cannot be stated as FOL rules, however, since they represent substitution of terms within larger statements. Nonetheless, the same basic propagation scheme for the truth values will work. In addition to the 2 pseudo-rules for each equality statement, we treat the system as knowing A=A for all terms A. Knuth-Bendix completion should be implemented eventually, which allows us to ignore one of the two inference directions for each equality statement without losing anything. Ultimately, I believe this means equational reasoning can be implemented as an extension to the normalization step of the unification algorithm (but I could be wrong here).
Finally, we also need one hypothetical reasoning rule: if assuming A implies B with truth calue TV1, and assuming B implies A with truth value TV2, then A=B with truth value (& TV1 TV2) for an appropriate &-formula. Rather than implementing this as one complicated hypothetical reasoning step, it seems to me that the most convenient way would be to implement rule-level reasoning in general (ie, a query can be a rule, rules can be combined with other rules to produce new rules, etc) and then add this as a way of recording high-probability mutual implication.

That's all, folks!

PS:

In addition to the above reasoning system, for HOL Light, we also need the axiom of choice and the axiom of infinity. (The axiom of extensionality is covered by normalization.) Both of these axioms are very much "mathematical" in nature, and it isn't clear whether they are needed for more common-sense reasoning. (My suspicion is that they would emerge in other forms, ie, in coding the KB we'd encode something like these plus much more.)

Tuesday, July 27, 2010

Rule Combination

The rule combination problem is: if we have X->A and Y->A, and we get probabilistic information about both X and Y, how do we integrate the two resulting estimates of A?

The normal bayes net semantics would give us an interval of possible probabilities, based on the probability of X&Y (which we may not know, but perhaps we could use some naive estimate). We could then take the midpoint of this or some such. Call this combination approach "averaging".

Another approach is to use the noisy-or combination rule, which can be thought of as adding the two probabilities together (though we also subtract out the product of the two, which makes sure we don't add up to greater than 1). This is great for making highly mathematical systems (such as higher-order logic) probabilistic, because it captures the monotonic nature of logical reasoning: finding additional rules which apply can only bolster the probability of A, never decrease it. A system based on averaging will not be very good for doing this, because if several arguments don't work very well but one does, then an averaging approach will come out with a not-very-large probability; an additive approach, however, will come up with a large probability (as seems right).

A purely additive approach is limited, however: sometimes we want additional information to detract probability! The noisy-and rule does this. Noisy-and adds together the probability that an event won't happen that are contributed by the individual rules, in the exact way that noisy-or adds the probabilities that it *will* happen. This makes sense when we want new rules that we find to represent additional reasons that an event might not happen. This can be complement noisy-or if desired, in a way that preserves the additive "mathematical" style of rule combination where it's desired, but allows subtractive combination where it's more appropriate. ("Bayesian logic programming" uses the approach of allowing specific predicates to declare themselves noisy-or predicates or noisy-and predicates; rules whose consequence is an or-predicate will get treated additively, while rules whose consequence is an and-predicate will get treated as subtractive. YKY suggests an alternate approach, in which we declare individual rules to be either additive or subtractive; this seems a bit more powerful, and loses nothing.)

One problem is that the probabilities given by additive and subtractive rules no longer can be learned simply by counting conditional frequencies. Learning does not become impossible (and perhaps not even very difficult), but the meaning of the numbers is no longer very local, since a probability must be calculated as a sum of many rules. (Think about it: if we learned X->A and Y->A by counting the conditional frequencies, our estimate for A in the presence of both X and Y would be way too high.) The averaging approach may allow simpler learning algorithms. Yet, it seems less powerful as a representational choice.

A problem related to rule combination is how we should reason using lifted estimates. If we have a probability for C, a formula which has some free variables, then the Genifer approach is to allow this to be instantiated and consider it a probability estimate for the instantiated formula as well. Supposing it instantiates to B, and we've also got a rule B->E, which connects it to a larger Bayes Net we're making in memory, the way to handle this is clear: it's fine, even necessary, for initial items in Bayes Nets to have probabilities like this. Suppose, though, that it instead instantiates to a formula that's in the middle of a network, Q->B->E. How do we use the info now? Again, we have an information combination problem. Similar thoughts about averaging, additive, and subtractive approaches apply.

Fortunately, these two problems are just examples of a single more general problem. Assume we have rules X->A and Y->B, where A and B unify to C (that is, instantiating either A or B can result in C). Further, treat unconditional probabilities (like A) as just conditional probabilities where the list of conditions is empty-- that is, X , Y. or both may be tautologous. This is the general form of the combination problem.

If this was all there was to it, it would boil down to that choice: averaging vs additive/subtractive combination, or perhaps some mix of the two. (We could declare rules to be either a frequency, a reason-for, or a reason-against; but we'd then need a special combination rule to tell us what to do when we had a mix of frequencies and additive/subtractive rules, and it's not clear at all what combination would be appropriate... the result would seem rather hackish.) However, there are further complications.

It seems appropriate to sometimes derive rules from one another. A->B and B->C can combine to give an abbreviated rule A->C, summarizing what knowing A does to C's probability. However, all of the combination rules so far will mess up if we allow such derivations to take place. An additive (or subtractive) rule will now take the additional rule into account as well, resulting in twice as much modification from A as is merited. An averaging rule will not do that, but if we have other knowledge about B, then an averaging rule will fallaciously discount its importance. The effect will multiply as more derivations are allowed. It seems that, essentially, we are taking the same information into account more than once. This suggests that we should avoid applying two different rules if we know that one is derived from the other. Ideally, it seems we should prefer the original rules to the derived rules; however, that's only the case when we have enough time to reconstruct all the steps which are accounted for in the derivation (to check them for the particular situation); before that point, it's less clear when we should trust which rule (a partial derivation may already take into account critical situation-specific info that would invalidate the derived rule, but on the other hand, a derived rule may summarize a long and valuable chain of reasoning which is essentially sound but which would take too long to replicate). Some heuristic for choosing between the two should be employed.

In other words: the rule combination problem is solved by first asking if one rule is derived from the other. If not, either averaging or additive/subtractive combination can be done (a choice which has to be made for the system); if so, a heuristic is employed to choose between the two rules.

This has a strong flavor of NARS revision, for those who know NARS.

YKY has pointed out an interesting example problem to use for thinking about rule combination. I will give two solutions. The problem is as follows: Sue is punctual. Punctual people arrive on time. However, Sue dies in a car crash on the way to a meeting. Dead people do not arrive on time. This is represented with "P" ("Sue is punctual"), "P->A" ("If someone is punctual she'll probably arrive on time"), "D" ("Sue is dead"), "D->~A" ("Dead people probably do not arrive on time"). Assume all the rules get properly instantiated to talk about Sue, and of course to talk about some other specific variables we're hiding (what time it is, arrive where, etc).

We get a very bad result if we just apply noisy-or: noticing that Sue is dead actually will increase her chances of arrival slightly! We get just a little better by averaging, but still not satisfactory. One good solution here is to model the rule for death as subtractive: being dead subtracts from your chances no matter what they were. This gets a good result; that is, it'll be very near zero.

Another solution, though, can be hit upon: perhaps death does not override punctuality because it is a negative rule, but rather, because the punctuality rule must ultimately be seen as derived from a chain of reasoning which includes the assumption that the subject is alive! This seems more fundamentally right to me. It also brings up an idea: it might be the case that rules entered in the starting knowledge base may not all be "basic" in the sense of non-derived: rather, it may be that some rules should already be noted as derived from one another, even if we don't actually have all the rules necessary to replicate the derivation. Consider a system like Cyc. Cyc has different "layers" of concepts, ranging from physical concepts (hard, soft, long, short...) to social concepts (like, dislike, ...). The idea I'm putting forward is that in a probabilistic version of this sort of KB, it might be useful to tell the system that social phenomena ultimately depend on physical phenomena, even if we can't trace out the lines of dependence in a precise way. What this does is lets the system know that physical facts will override social facts when the two suggest conflicting outcomes; this will usually be the correct inference. (There are exceptions, however; biological organisms tend to systematically violate rules of thumb which work for simpler physical systems, such as dissipation of heat (through homeostasis), Newton-style physics (through self-guided motion), et cetera.)

In fact, it might be useful to be able to learn the dependence relationships of rules without actually having derived one from another... This would reflect the way that we assume, based on large amounts of evidence, that biology is just a very complicated dance of chemistry (ie, biological rules follow from combinations of chemical rules) without actually knowing how everything works. That seems like dangerous territory, though, so it'll have to be a thought for later...

Sunday, July 11, 2010

Eliminating my HOL "Meta-Rules"

HOL systems, as I've mentioned before, are not usually specified just as inference rules on statements. Instead, they are specified indirectly, as inference rules on sequents. A sequent is a set of premises and a conclusion, representing a valid argument in the logic. (This is represented symbolically in the form A, B, C,... |- D where A, B, C,... is the set of premises and D is the conclusion. The |- symbol is called the turnstyle.) So, specifying a system via a sequent calculus is a slightly curious practice: rather than just specifying inference rules directly, we specify a system of reasoning which comes up with valid inference rules. Historically, this move to the meta-level is what led to the field we call "proof theory".

My previous approach was to preserve this multi-level structure, calling the sequents "rules" and the sequent inference rules "meta-rules" (since they are rules for creating rules). However, it would be more efficient to eliminate the extra level if possible.

In general, this may be a complicated process. Some sequent derivation rules have special syntactical restrictions, such as "... X and Y share no free variables ...". Also consider sequent rules like "If A,B |- C, then A |- (B->C)". It isn't obvious that there is a simple way to turn a specification of a system in terms of sequent derivation rules into a direct specification consisting of inference rules on statements.

Typical systems, though, will obey a few nice properties. First, adding more premises to a valid sequent will always result in a valid sequent. Furthermore, many sequent derivation rules just propagate along the premises from the sequents they start with to the sequents they produce. For sequents which do that, the job is easy. Consider an example:

"From A, B, C... |- P and D, E, F... |- Q conclude A, B, C... D, E, F... |- (P&Q)"

In this case, we'd just translate the whole thing to the rule "From P and Q conclude (P&Q)". These sorts can be eliminated from the meta-level immediately, which is good (every little bit helps).

Unfortunately, not all cases are so easy. Many inference rules will almost pass along the premises untouched, but make some little modifications to the list. These modifications will usually be to remove specific premises, occasionally to add premises, and quite rarely to modify the internals of premises. The example above, "If A,B |- C, then A |- (B->C)", is one example of this: one premise is eliminated in the process of drawing the conclusion. Getting rid of meta-rules of this sort requires cleverness.

First-order natural deduction systems handle that particular example in a nice way: at any time during reasoning, we may make a new assumption. This can be anything. We keep the list of assumptions that have been made. Now, say we just concluded something, S. One deduction move we can now make is to remove an assumption from the list (call it A), and stick it on the beginning of S with an arrow: A->S is our new conclusion, which is true in the new context (ie, the old context minus one assumption).

This general idea could be extended to any rules which add or eliminate premises: they are seen as just modifying the list of assumptions.

A practical implementation of this sort of idea for Genifer might be to engage in hypothetical reasoning via conditional queries: we make a query with a list of assumptions attached, to be used in addition to the actual facts. However, some issues may come up with respect to the difference between the classical conditional and the probabilistic conditional. I need to consider this more. (The approach is similar to my "rule-level reasoning", but rather than reasoning about rules using conditional queries, we reason about statements...)

This leaves rules which modify the sequents internally. The primary example of this is variable instantiation, but that's handled by unification! So, in summary, it appears to me that my "meta-level" is not needed in Genifer and can be eliminated with a little patience...

All of the above discussion was done with a translation of HOL Light in mind, but could apply to IG or an IG variant equally well.

Thursday, July 1, 2010

HOL, Paradox Free: a plan of attack, or two.

Looking at the Illative Combinatory Logic paper today, and thinking about paradox-free HOL.

There are plenty of paradox-free HOLs out there. The question is, how do we translate them into our system? Most HOLs seem to be described in sequent calculus. This means the system has an extra meta-level that ours lacks: it's not really specified as a system for reasoning about HOL statements; rather, it's specified as a system for reasoning about HOL proofs. This means there are at least 3 levels at which the translation can be made: (H) we could try to use the sequent derivation rules as rules in our system, (M) we could use the sequents themselves as rules in our system, and (L) we could use the conditional statements of the HOL as our rules. (I'm labeling these as High, Medium, and Low, in case you're wondering what the letters stand for.)

H might be easiest in that it would only require a finite number of rules to be added to the system. However, it would mean making room for sequents as a type of statement in the system, which could be awkward. We would still want to be able to make basic statements, ie, statements about actual things rather than just about what follows from what... so we would need to also add rules for using sequents to derive base facts from one another. On top of all this, we need to worry about how the probabilistic reasoning is done. It seems that the sequent reasoning would end up being crisp, which means the probabilistic reasoning would only happen when we applied sequents to uncertain base-statements. Again, this seems awkward.

M would require an infinite number of rules, meaning that we'd have to implement rule-level reasoning to derive new rules as they are needed. I discussed rule-level reasoning before; the additional sequent-derivation rules would have to be integrated with my earlier thoughts. For the most part, I think this would mean restricting my earlier suggestion to make sure that it's consistent for the same reasons the sequent rules are consistent. Figuring out the maximum level of rule-level reasoning I can preserve may be difficult. Figuring out the minimum level that's sufficient probably won't be, though! What would be somewhat difficult would be generalizing the backwards-chaining reasoner to create new rules for itself to backwards-chain on (in a sorta meta-backwards-chaining way).

L would require a doubly infinite number of rules (so to speak). The approach would again require rule-level reasoning, but this time rather than implementing a finite number of meta-rules (based on the rules for deriving new sequents) we'd appear to need an infinite number, specified by a finite number of meta-meta rules. This is obviously too much; so, we'd have to pull back and instead figure out a system which satisfied the constraints laid down by the sequent rules. In other words, we treat the sequent-calculus description of the logic as a specification of target features rather than a description of the deduction rules of a system. This is obviously a more difficult approach... an interesting challenge, but more difficult.

Also worth a mention: a tempting, but probably incorrect direction: merge the levels so that we don't have to deal with so many. We'd start allowing conditionals to compose with other language elements, like in regular FOL, so that we can have stuff like A->(B->C). We'd then try to use regular reasoning to implement all the different levels at once. This sounds powerful, but it would probably just create an inconsistent system, defeating the point.

Conclusion

Approach M seems to be the best plan, so I will get to work on that!

Saturday, June 26, 2010

Careful Ascent

YKY has suggested that the strategy will be as follows:

1. start with horn clauses, replacing classic conditional with probabilistic. This relatively simple to reason about, even including abduction and lifted reasoning, with a bayes-net construction algorithm that YKY calls "prolog trick".
Add in negation. If we were dealing with classical conditional, we'd now have full FOL clauses-- so we've got what can be thought of as a probabilistic generalization of FOL clauses. The extra complication to the Bayesian networks is trivial (and doesn't need to fall back on negation-as-failure), but I want to examine some less obvious complications.
Where the "prolog trick" uses FOL resolution (to minimally instantiate a rule so as to apply to the present case), put in HOL resolution so that we can start using HOL predicates and probabilistically generalizing over HOL variables. (The second one there is exciting.) This also seems simple, except for the bit about having to implement HOL unification. Again, I'll see about complications.

Complications with #2

Completeness for FOL and completeness for horn-clauses are, to an extent, different beasts. Horn-clause reasoning is complete with respect to the goal of establishing "concrete" facts (statements which do not include alternatives). FOL completeness, in contrast, wants to be able to reason about truth values for any statement that the system can express.

It isn't hard to apply FOL resolution to Horn clauses to deduce more rules which are consequences of the existing ones (in fact, it's quite easy). It's also not hard to allow conditional queries to the Bayesian networks which we'll be constructing, which would allow us to deduce new rules from existing information. My worry, though, is that we'll be able to have two rules whose information cannot be integrated by the system.

An example:

A1: P(C|A&B)=1
A2: P(~B|C)=1

We should be able to deduce P(~B|A)=1.

C1: P(A&B&C)=P(A&B) [from A1]
C2: P(B|C)=0 [from A2]
C3: P(B&C)=0 [mult. C2 by P(C)]
C4: P(A&B&C)=P(A|B&C)*P(B&C)=0 [decompose P(A&B&C) and then apply C3]
C5: P(A&B)=0 [from C1, C4]
C6: P(B|A)=0 [divide by P(A), assuming it's not 0]
C7: P(~B|A)=1

Let's call this "rule-level reasoning". YKY has already asserted (private email) that this sort of reasoning is needed for abduction, although I disagree with him there: I think abduction should be handled by the standard Bayes Net belief propagation, rather than by reversing the direction of a rule via probabilistic reasoning. This means that I think, when we do "prolog trick" construction of the baye's net, we don't just look for rules which have consequences which are nodes in our current network; we also look for rules whose premises are in our current network. When we glue those rules onto the network, the consequences are added as uncertain variables, which means we now try to determine their truth. If we find any evidence one way or the other, it gets added to the network, and the network starts doing abductive reasoning automatically via belief propagation.

My impression is that YKY prefers the idea of looking for rule candidates in basically the same way, but before adding the rule to the network, use rule-level reasoning to change its direction so that all reasoning is deduction. I don't see the point of this, but I do agree that rule-level reasoning is needed when we can't paste a rule into the network without creating a directed cycle.

How should rule-level reasoning be implemented? I think the right way of doing it is just to implement the possibility of conditional queries (a standard belief propagation task, which would complicate "prolog trick" network construction just in that we'd be looking for links between the query variables). When we want to perform rule-level reasoning, we just make a query that is in the form of the rule we want to construct. The resulting probability is the conditional probability for the rule! Easy as pie, and we don't have to worry about all the probability algebra stuff I did in the explicit derivation above. (This gets approximated by belief propagation.)

This does not quite tidy up the completeness concerns. Essentially, I want to know that the reasoning is (an approximation of) complete reasoning given the Kolmogorov axioms and the semantics for our probabilistic generalizations, ie, "the probability of a statement with unbound variables is the probability that it will be true under a random instantiation."

Bayes net reasoning (belief prop) is approximately complete for propositional Horn-with-negation networks so long as the rules form no directed cycles. (It is precisely complete if there are no cycles at all. Actually, it's also been proven to be precise for networks with single cycles. The precise conditions for belief prop giving exact answers are unknown.) With rule-level reasoning, it should be complete for rule-sets with (correctable) directed cycles as well. If that were established rigorously (showing for example that this sort of reasoning will not accidentally use a rule twice in different forms and get the probabilities wrong as a result), then what would remain would be precisely the question of whether our reasoning system is complete over our definition of probabilistic generalizations.

An approximate reasoning method that would be "obviously complete" would be to actually take random instantiations and use the frequency of their truth (ie, their average probability) to decide the probability of the generalization. YKY has proposed essentially this technique for the estimation of a generalization's probability. However, we also want to do lifted estimation: deduce the probability of a predicate as a marginal probability in the lifted network constructed by instantiating rules only enough to apply to the predicate. This lifted estimation is certainly correct under the assumption that there is no other relevant information, but certainly not correct for all cases. At present we have no formalism for combining the lifted reasoning with random sampling or other sources of information.

We need one.

(Tackling this will probably be the topic of my next post.)

Moving on to HOL...

Complications with #3

My reservations here arise essentially from my desire to take the "descent" approach I mentioned in my last post: start with system IG, an untyped HOL, and create a "prolog trick" reasoning system compatible with that. YKY's approach is an "ascent" path, which catches me off-guard as to how we should approach soundness and completeness.

HOL is never classically complete, but it can be complete in various weaker senses. A popular option is to use Henkin semantics rather than classical semantics. IG is known to be sound and complete w.r.t. its FOL subset, which is not quite as good; as far as I know, it's Henkin soundness and completeness is an open question. Another appealing approach would be to show soundness and completeness with respect to an interpretation of ZFC, typed HOL, or another standard system. We could investigate other untyped HOLs to find systems complete in stronger senses if we wanted.

That's all for today. :)

Blog Archive

Latex Maths