The rule combination problem is: if we have X->A and Y->A, and we get probabilistic information about both X and Y, how do we integrate the two resulting estimates of A?
The normal bayes net semantics would give us an interval of possible probabilities, based on the probability of X&Y (which we may not know, but perhaps we could use some naive estimate). We could then take the midpoint of this or some such. Call this combination approach "averaging".
Another approach is to use the noisy-or combination rule, which can be thought of as adding the two probabilities together (though we also subtract out the product of the two, which makes sure we don't add up to greater than 1). This is great for making highly
mathematical systems (such as higher-order logic) probabilistic, because it captures the monotonic nature of logical reasoning: finding additional rules which apply can only bolster the probability of A, never decrease it. A system based on averaging will not be very good for doing this, because if several arguments don't work very well but one does, then an averaging approach will come out with a not-very-large probability; an additive approach, however, will come up with a large probability (as seems right).
A purely additive approach is limited, however: sometimes we want additional information to detract probability! The noisy-and rule does this. Noisy-and adds together the probability that an event
won't happen that are contributed by the individual rules, in the exact way that noisy-or adds the probabilities that it *will* happen. This makes sense when we want new rules that we find to represent additional reasons that an event might
not happen. This can be complement noisy-or if desired, in a way that preserves the additive "mathematical" style of rule combination where it's desired, but allows subtractive combination where it's more appropriate. ("Bayesian logic programming" uses the approach of allowing specific predicates to declare themselves noisy-or predicates or noisy-and predicates; rules whose consequence is an or-predicate will get treated additively, while rules whose consequence is an and-predicate will get treated as subtractive. YKY suggests an alternate approach, in which we declare individual rules to be either additive or subtractive; this seems a bit more powerful, and loses nothing.)
One problem is that the probabilities given by additive and subtractive rules no longer can be learned simply by counting conditional frequencies. Learning does not become impossible (and perhaps not even very difficult), but the meaning of the numbers is no longer very local, since a probability must be calculated as a sum of many rules. (Think about it: if we learned X->A and Y->A by counting the conditional frequencies, our estimate for A in the presence of both X and Y would be way too high.) The averaging approach may allow simpler learning algorithms. Yet, it seems less powerful as a representational choice.
A problem related to rule combination is how we should reason using lifted estimates. If we have a probability for C, a formula which has some free variables, then the Genifer approach is to allow this to be instantiated and consider it a probability estimate for the instantiated formula as well. Supposing it instantiates to B, and we've also got a rule B->E, which connects it to a larger Bayes Net we're making in memory, the way to handle this is clear: it's fine, even necessary, for initial items in Bayes Nets to have probabilities like this. Suppose, though, that it instead instantiates to a formula that's in the middle of a network, Q->B->E. How do we use the info now? Again, we have an information combination problem. Similar thoughts about averaging, additive, and subtractive approaches apply.
Fortunately, these two problems are just examples of a single more general problem. Assume we have rules X->A and Y->B, where A and B unify to C (that is, instantiating either A or B can result in C). Further, treat unconditional probabilities (like A) as just conditional probabilities where the list of conditions is empty-- that is, X , Y. or both may be tautologous. This is the general form of the combination problem.
If this was all there was to it, it would boil down to that choice: averaging vs additive/subtractive combination, or perhaps some mix of the two. (We could declare rules to be either a frequency, a reason-for, or a reason-against; but we'd then need a special combination rule to tell us what to do when we had a mix of frequencies and additive/subtractive rules, and it's not clear at all what combination would be appropriate... the result would seem rather hackish.) However, there are further complications.
It seems appropriate to sometimes derive rules from one another. A->B and B->C can combine to give an abbreviated rule A->C, summarizing what knowing A does to C's probability. However, all of the combination rules so far will mess up if we allow such derivations to take place. An additive (or subtractive) rule will now take the additional rule into account as well, resulting in twice as much modification from A as is merited. An averaging rule will not do that, but if we have other knowledge about B, then an averaging rule will fallaciously discount its importance. The effect will multiply as more derivations are allowed. It seems that, essentially, we are taking the same information into account more than once. This suggests that we should avoid applying two different rules if we know that one is derived from the other. Ideally, it seems we should prefer the original rules to the derived rules; however, that's only the case when we have enough time to reconstruct all the steps which are accounted for in the derivation (to check them for the particular situation); before that point, it's less clear when we should trust which rule (a partial derivation may already take into account critical situation-specific info that would invalidate the derived rule, but on the other hand, a derived rule may summarize a long and valuable chain of reasoning which is essentially sound but which would take too long to replicate). Some heuristic for choosing between the two should be employed.
In other words: the rule combination problem is solved by first asking if one rule is derived from the other. If not, either averaging or additive/subtractive combination can be done (a choice which has to be made for the system); if so, a heuristic is employed to choose between the two rules.
This has a strong flavor of NARS revision, for those who know NARS.
YKY has pointed out an interesting example problem to use for thinking about rule combination. I will give two solutions. The problem is as follows: Sue is punctual. Punctual people arrive on time. However, Sue dies in a car crash on the way to a meeting. Dead people do not arrive on time. This is represented with "P" ("Sue is punctual"), "P->A" ("If someone is punctual she'll probably arrive on time"), "D" ("Sue is dead"), "D->~A" ("Dead people probably do not arrive on time"). Assume all the rules get properly instantiated to talk about Sue, and of course to talk about some other specific variables we're hiding (what time it is, arrive where, etc).
We get a very bad result if we just apply noisy-or: noticing that Sue is dead actually will increase her chances of arrival slightly! We get just a little better by averaging, but still not satisfactory. One good solution here is to model the rule for death as subtractive: being dead subtracts from your chances no matter what they were. This gets a good result; that is, it'll be very near zero.
Another solution, though, can be hit upon: perhaps death does not override punctuality because it is a negative rule, but rather, because the punctuality rule must ultimately be seen as derived from a chain of reasoning which includes the assumption that the subject is alive! This seems more fundamentally right to me. It also brings up an idea: it might be the case that rules entered in the starting knowledge base may not all be "basic" in the sense of non-derived: rather, it may be that some rules should already be noted as derived from one another, even if we don't actually have all the rules necessary to replicate the derivation. Consider a system like Cyc. Cyc has different "layers" of concepts, ranging from physical concepts (hard, soft, long, short...) to social concepts (like, dislike, ...). The idea I'm putting forward is that in a probabilistic version of this sort of KB, it might be useful to tell the system that social phenomena ultimately depend on physical phenomena, even if we can't trace out the lines of dependence in a precise way. What this does is lets the system know that physical facts will override social facts when the two suggest conflicting outcomes; this will usually be the correct inference. (There are exceptions, however; biological organisms tend to systematically violate rules of thumb which work for simpler physical systems, such as dissipation of heat (through homeostasis), Newton-style physics (through self-guided motion), et cetera.)
In fact, it might be useful to be able to
learn the dependence relationships of rules
without actually having derived one from another... This would reflect the way that we assume, based on large amounts of evidence, that biology is just a very complicated dance of chemistry (ie, biological rules follow from combinations of chemical rules) without actually knowing how everything works. That seems like dangerous territory, though, so it'll have to be a thought for later...