The normal bayes net semantics would give us an interval of possible probabilities, based on the probability of X&Y (which we may not know, but perhaps we could use some naive estimate). We could then take the midpoint of this or some such. Call this combination approach "averaging".

Another approach is to use the noisy-or combination rule, which can be thought of as adding the two probabilities together (though we also subtract out the product of the two, which makes sure we don't add up to greater than 1). This is great for making highly

*mathematical*systems (such as higher-order logic) probabilistic, because it captures the monotonic nature of logical reasoning: finding additional rules which apply can only bolster the probability of A, never decrease it. A system based on averaging will not be very good for doing this, because if several arguments don't work very well but one does, then an averaging approach will come out with a not-very-large probability; an additive approach, however, will come up with a large probability (as seems right).

A purely additive approach is limited, however: sometimes we want additional information to detract probability! The noisy-and rule does this. Noisy-and adds together the probability that an event

*won't*happen that are contributed by the individual rules, in the exact way that noisy-or adds the probabilities that it *will* happen. This makes sense when we want new rules that we find to represent additional reasons that an event might

*not*happen. This can be complement noisy-or if desired, in a way that preserves the additive "mathematical" style of rule combination where it's desired, but allows subtractive combination where it's more appropriate. ("Bayesian logic programming" uses the approach of allowing specific predicates to declare themselves noisy-or predicates or noisy-and predicates; rules whose consequence is an or-predicate will get treated additively, while rules whose consequence is an and-predicate will get treated as subtractive. YKY suggests an alternate approach, in which we declare individual rules to be either additive or subtractive; this seems a bit more powerful, and loses nothing.)

One problem is that the probabilities given by additive and subtractive rules no longer can be learned simply by counting conditional frequencies. Learning does not become impossible (and perhaps not even very difficult), but the meaning of the numbers is no longer very local, since a probability must be calculated as a sum of many rules. (Think about it: if we learned X->A and Y->A by counting the conditional frequencies, our estimate for A in the presence of both X and Y would be way too high.) The averaging approach may allow simpler learning algorithms. Yet, it seems less powerful as a representational choice.

A problem related to rule combination is how we should reason using lifted estimates. If we have a probability for C, a formula which has some free variables, then the Genifer approach is to allow this to be instantiated and consider it a probability estimate for the instantiated formula as well. Supposing it instantiates to B, and we've also got a rule B->E, which connects it to a larger Bayes Net we're making in memory, the way to handle this is clear: it's fine, even necessary, for initial items in Bayes Nets to have probabilities like this. Suppose, though, that it instead instantiates to a formula that's in the middle of a network, Q->B->E. How do we use the info now? Again, we have an information combination problem. Similar thoughts about averaging, additive, and subtractive approaches apply.

Fortunately, these two problems are just examples of a single more general problem. Assume we have rules X->A and Y->B, where A and B unify to C (that is, instantiating either A or B can result in C). Further, treat unconditional probabilities (like A) as just conditional probabilities where the list of conditions is empty-- that is, X , Y. or both may be tautologous. This is the general form of the combination problem.

If this was all there was to it, it would boil down to that choice: averaging vs additive/subtractive combination, or perhaps some mix of the two. (We could declare rules to be either a frequency, a reason-for, or a reason-against; but we'd then need a special combination rule to tell us what to do when we had a mix of frequencies and additive/subtractive rules, and it's not clear at all what combination would be appropriate... the result would seem rather hackish.) However, there are further complications.

It seems appropriate to sometimes derive rules from one another. A->B and B->C can combine to give an abbreviated rule A->C, summarizing what knowing A does to C's probability. However, all of the combination rules so far will mess up if we allow such derivations to take place. An additive (or subtractive) rule will now take the additional rule into account as well, resulting in twice as much modification from A as is merited. An averaging rule will not do that, but if we have other knowledge about B, then an averaging rule will fallaciously discount its importance. The effect will multiply as more derivations are allowed. It seems that, essentially, we are taking the same information into account more than once. This suggests that we should avoid applying two different rules if we know that one is derived from the other. Ideally, it seems we should prefer the original rules to the derived rules; however, that's only the case when we have enough time to reconstruct all the steps which are accounted for in the derivation (to check them for the particular situation); before that point, it's less clear when we should trust which rule (a partial derivation may already take into account critical situation-specific info that would invalidate the derived rule, but on the other hand, a derived rule may summarize a long and valuable chain of reasoning which is essentially sound but which would take too long to replicate). Some heuristic for choosing between the two should be employed.

In other words: the rule combination problem is solved by first asking if one rule is derived from the other. If not, either averaging or additive/subtractive combination can be done (a choice which has to be made for the system); if so, a heuristic is employed to choose between the two rules.

This has a strong flavor of NARS revision, for those who know NARS.

YKY has pointed out an interesting example problem to use for thinking about rule combination. I will give two solutions. The problem is as follows: Sue is punctual. Punctual people arrive on time. However, Sue dies in a car crash on the way to a meeting. Dead people do not arrive on time. This is represented with "P" ("Sue is punctual"), "P->A" ("If someone is punctual she'll probably arrive on time"), "D" ("Sue is dead"), "D->~A" ("Dead people probably do not arrive on time"). Assume all the rules get properly instantiated to talk about Sue, and of course to talk about some other specific variables we're hiding (what time it is, arrive where, etc).

We get a very bad result if we just apply noisy-or: noticing that Sue is dead actually will increase her chances of arrival slightly! We get just a little better by averaging, but still not satisfactory. One good solution here is to model the rule for death as subtractive: being dead subtracts from your chances no matter what they were. This gets a good result; that is, it'll be very near zero.

Another solution, though, can be hit upon: perhaps death does not override punctuality because it is a negative rule, but rather, because the punctuality rule must ultimately be seen as derived from a chain of reasoning which includes the assumption that the subject is alive! This seems more fundamentally right to me. It also brings up an idea: it might be the case that rules entered in the starting knowledge base may not all be "basic" in the sense of non-derived: rather, it may be that some rules should already be noted as derived from one another, even if we don't actually have all the rules necessary to replicate the derivation. Consider a system like Cyc. Cyc has different "layers" of concepts, ranging from physical concepts (hard, soft, long, short...) to social concepts (like, dislike, ...). The idea I'm putting forward is that in a probabilistic version of this sort of KB, it might be useful to tell the system that social phenomena ultimately depend on physical phenomena, even if we can't trace out the lines of dependence in a precise way. What this does is lets the system know that physical facts will override social facts when the two suggest conflicting outcomes; this will usually be the correct inference. (There are exceptions, however; biological organisms tend to systematically violate rules of thumb which work for simpler physical systems, such as dissipation of heat (through homeostasis), Newton-style physics (through self-guided motion), et cetera.)

In fact, it might be useful to be able to

*learn*the dependence relationships of rules

*without*actually having derived one from another... This would reflect the way that we assume, based on large amounts of evidence, that biology is just a very complicated dance of chemistry (ie, biological rules follow from combinations of chemical rules) without actually knowing how everything works. That seems like dangerous territory, though, so it'll have to be a thought for later...

Read my blog post on "causality" =) It's still a stub but I think it is key to answering your question. What you call "basic rules" and "derived rules" seem to be just causality in disguise. A causal Bayes net can determine what rules are basic and what are derived, and thus determine when to use averaging and when to use overriding. (I think only 2 types of combination are needed: averaging and overriding. Overriding is a special case of noisy-AND in which one rule wins).

ReplyDeleteTo summarize my algorithm:

Given that 2 chains of inference (chain A and chain B) converge on a single conclusion, whose respective conclusions may be mutually in conflict.

1. If *no* node in chain A is in a causal relation with any nodes in chain B, nor vice versa, (ie, the 2 chains have no causally related nodes), then APPLY AVERAGING RULE.

2. If some node(s) in chain A causes some node(s) in chain B to be invalid (thus making chain B invalid), then chain A's result OVERRIDES chain B's. Vice versa. Note: we need a causal Bayes net to draw that information.

In the "Sue is punctual" example, at least one way to look at it is that Sue's death _causes_ her to become unable to be punctual. You can also say that Sue's being alive causes her to be able to be punctual, but that her death causes her to NOT be alive, thus the origin of that chain is rendered invalid.

Another way to express this problem is as follows. You are given conditional probabilities P(A|X) and P(A|Y). What is P(A|X,Y)? The laws of probability do not say. It could be anything at all. For example if A = X xor Y, then P(A|X) = P(A|Y) = 1, P(A|X,Y) = 0. It is possible, but you would not derive it by any of the proposed rules such as addition, multiplication, min, max, or average.

ReplyDeleteThis problem occurs in data compression when X and Y are contexts and A is the next symbol or bit that we wish to predict. The values P(A|X) and P(A|Y) are based on counting how many times A occurs in context X and Y. We don't have statistics for the case of X and Y both occurring, either because they have not occurred together before, or no statistics were collected because of finite computing resources.

Solomonoff induction would say to average all possible functions f: (0,1)^2 -> (0,1) weighted by 2^-|f| where by |f| I mean the length of the description of f in bits. This suggests some kind of averaging because

f(X,Y) = 1/2

f(X,Y) = X

f(X,Y) = Y

f(X,Y) = min(X,Y)

f(X,Y) = max(X,Y)

f(X,Y) = (X+Y)/2

all have the proper form and are algorithmically simple. Of course you can probably think of many other functions. Whatever the result, it must be symmetric because swapping X and Y does not change |f|. Also, it should be monotonic in X and Y because functions like f(X,Y) = X are simpler than f(X,Y) = 1-X.

Experimentally the best result I have obtained is to average in the logistic domain:

f(X,Y) = squash((stretch(X)+stretch(Y))/2)

where

stretch(X) = ln(X/(1-X))

squash(X) = 1/(1 + e^-X) = stretch^-1(X)

which is what I use in PAQ8 and derivative programs like LPAQ and ZPAQ. Or more precisely, I use a weighted average in the logistic domain where the weights are adaptive based on past performance over the classes of contexts containing X and Y and the weights might be selected by other contexts.

For example, suppose we have

P(Sue is on time|Sue is punctual) = 0.9

P(Sue is on time|Sue is dead) = 0.0001

P(Sue is on time|Sue is punctual, Sue is dead) = 0.0009

which at least seems intuitively right.

Matt,

ReplyDeleteSeems like a reasonable approach to averaging, yes. I didn't go into the averaging formula in much detail at all, but I know there are lengthy discussions of the issue elsewhere...

I think I should emphasize, though, that noisy-or is *not* properly regarded as a solution to the same problem (ie, how to estimate P(A|BC) from P(A|B) and P(A|C)). Instead, I think it's better to regard it as a way *around* the problem: we give up the idea that we're recording conditional probabilities, and instead we record contributions which come from different factors. Informally, we refer to the number we keep for A->B as "the probability that A, when present, will cause B". (This should not be taken as literal causation unless we've got a causal BN.)

YKY,

How does your alg handle the case in the post where we have A->B and B->C and a rule A->C?

What I say for this case is that it's critical to also know whether A->C was derived from A->B and B->C, which tells us that it doesn't provide more information once we've taken A->B and B->C into account. If it isn't derived from them, then it is proper to combine its output with B->C using some combination rule. If it is, it would be improper to do so.

To take a concrete example, let's say A, B, and C are heat, fire, and smoke-- so all the rules are oriented in the right causal direction: heat->fire, fire->smoke, heat->smoke. heat->smoke is a derived rule, ie, it simply summarizes the probability estimate that we get from combining the two other rules. Suppose smoke is our query. Now, suppose we also happen to know fire->light, which is still causally oriented; and furthermore we know light. The rule heat->smoke doesn't summarize the effect of our knowing light, so we should prefer the argument involving rules heat->fire, fire->light, fire->smoke. (Likelihood propagation needs to be used to take the light into account, but that is luckily not a rule combination problem. :D) However, since the causal link is oriented away from fire rather than towards it, it appears that your suggested algorithm doesn't know this. Am I right in this understanding? It looks to me like your alg will not know what to do here-- it will not discard the A->C rule, because there are no causal links directed towards A.

@Matt:

ReplyDeleteYour analysis seems to be a very restricted application of Solomonoff induction. According to this page:

http://wiki.lesswrong.com/wiki/Solomonoff_induction

we should take the weighted average of all functions that are compatible with the "data". In your case it seems that there is no data at all, just 2 inputs X and Y.

After considering some real-life examples, I found out that different combination rules seem to apply to examples that share the exact structural form (ie, sometimes it looks like noisy-AND, sometimes noisy-OR, etc.) And I also found out that causality plays a crucial role in all examples where it's present. This is interesting, because causal relations cannot be read off the Bayes net.

If we apply Solomonoff induction with more information, I guess we would arrive at a solution more similar to causal reasoning, but I'm still learning about this topic....