9  Shortcut rules



The fundamental rules above are in principle all we need to use to draw inferences from other inferences. But from them it is possible to derive some “shortcut” rules.

9.1 Falsity and truth cannot be altered by additional knowledge

Suppose that sentence \(\mathsfit{X}\) is judged to be completely impossible conditional on sentence \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0 \]

It can then be proved, from the fundamental rules, that \(\mathsfit{X}\) is also completely impossible if we add information to \(\mathsfit{Z}\). That is, for any sentence \(\mathsfit{Y}\) we’ll also have

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) = 0 \]


Try to prove this. (Hint: try using the and-rule one or more times.)

What if we use \(\lnot\mathsfit{X}\) for \(\mathsfit{Y}\), that is, what if we acquire knowledge that \(\mathsfit{X}\) is actually true? Then it can be proved that all probability calculations break down. The problem is that \(\lnot\mathsfit{X}\) and \(\mathsfit{Z}\) turn out to be mutually contradictory, so all inferences are starting from contradictory premises. Just as in formal logic, from contradictory premises we can obtain any conclusion whatsoever.

Note that this problem does not arise, however, if \(\mathsfit{X}\) is only extremely improbable conditional on \(\mathsfit{Z}\) – say with a probability of \(10^{-100}\) – rather than flat-out impossible. In practical applications we often approximate extremely small probabilities by \(0\), or extremely large ones by \(1\). If the probability calculations break down, we must then step back and correct the approximation.

By using the not-rule it is possible to prove that full certainty about a sentence behaves in a similar manner. If sentence \(\mathsfit{X}\) is judged to be completely certain conditional on sentence \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 1 \]

then, from the fundamental rules, \(\mathsfit{X}\) is also completely certain if we add information to \(\mathsfit{Z}\). That is, for any sentence \(\mathsfit{Y}\) we’ll also have

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) = 1 \]

9.2 Boolean algebra

First, it is possible to show that all rules you may know from Boolean algebra are a consequence of the fundamental rules of §  8.4. So we can always make the following convenient replacements anywhere in a probability expression:

Derived rules: Boolean algebra

\[ \begin{gathered} \lnot\lnot \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\land \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\lor \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\land \mathsfit{Y}= \mathsfit{Y}\land \mathsfit{X} \\[1ex] \mathsfit{X}\lor \mathsfit{Y}= \mathsfit{Y}\lor \mathsfit{X} \\[1ex] \mathsfit{X}\land (\mathsfit{Y}\lor \mathsfit{Z}) = (\mathsfit{X}\land \mathsfit{Y}) \lor (\mathsfit{X}\land \mathsfit{Z}) \\[1ex] \mathsfit{X}\lor (\mathsfit{Y}\land \mathsfit{Z}) = (\mathsfit{X}\lor \mathsfit{Y}) \land (\mathsfit{X}\lor \mathsfit{Z}) \\[1ex] \lnot (\mathsfit{X}\land \mathsfit{Y}) = \lnot \mathsfit{X}\lor \lnot \mathsfit{Y} \\[1ex] \lnot (\mathsfit{X}\lor \mathsfit{Y}) = \lnot \mathsfit{X}\land \lnot \mathsfit{Y} \end{gathered} \]

For example, a partial proof of the rule \(\mathsfit{X}\land \mathsfit{X}= \mathsfit{X}\), called “and-idempotence”, from the four fundamental rules goes as follows:

\[ \begin{aligned} &\mathrm{P}(\mathsfit{X}\land \mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& \\[1ex] &\qquad= \mathrm{P}(\mathsfit{X}| \mathsfit{X}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& &&\text{\small ∧-rule} \\[1ex] &\qquad= 1 \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& &&\text{\small truth-rule} \\[1ex] &\qquad= \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \end{aligned} \]

and with a similar procedure it can be shown that \(\mathsfit{X}\land \mathsfit{X}\) can be replaced with \(\mathsfit{X}\) no matter where it appears. The above proof shows that the and-idempotence rule is tightly connected with the truth-rule of inference.

9.3 Subtle importance of apparently trivial rules

Some of the fundamental or derived rules may seem obvious or unimportant, but actually are of extreme importance in data science. For instance the and-idempotence rule effectively asserts that whenever we draw inferences, redundant information or data is automatically counted only once.

This amazing feature saves us from a lot of headaches. Imagine that an AI decision agent at the assembly line has been given background information equivalent to saying that if an electronic component passes the heating test (\(\mathsfit{h}\)), then its probability of early failure (\(\mathsfit{f}\)) is only 10%:

\[\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) = 0.1\]

A new voltage test has also been devised, and if a component passes this test (\(\mathsfit{v}\)) then its probability of early failure is also only 10%:

\[\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{v}\land \mathsfit{Z}) = 0.1\]

However, it is discovered that the voltage test works in exactly the same way as the heating test – they’re basically the same test: \(\mathsfit{v}=\mathsfit{h}\). This means that if an element passes the heating test, then it has automatically passed the voltage test as well:

\[\mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) = 1\]

or equivalently, \(\mathsfit{v}\land \mathsfit{h}= \mathsfit{h}\land \mathsfit{h}= \mathsfit{h}\).

Now suppose that inadvertently we give our AI decision agent the redundant information that the element has passed the heating test and (therefore) the voltage test. What will the agent say about the probability of early failure, given this duplicate information? Let’s calculate:

\[ \begin{aligned} &\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{v}\land \mathsfit{h}\land \mathsfit{Z})&& \\[1ex] &\qquad= \frac{\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z})}{ \mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) } &&\text{\small ∧-rule} \\[1ex] &\qquad= \frac{\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z})}{1} =\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small initial probability} \\[1ex] &\qquad= \mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{h}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small ∧-rule} \\[1ex] &\qquad= 1 \cdot \mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small truth cannot be altered} \\[1ex] &\qquad= 0.1 &&\text{\small initial probability} \end{aligned} \]

Luckily the agent correctly detected the redundancy of the sentence \(\mathsfit{v}\) (“the element passed the voltage test”) and automatically discarded it, thank to the truth-rule, or equivalently the and-idempotence rule.

This feature is of paramount importance in machine learning and data-driven engineering: the “features” that we give as an input to a machine-learning classifier could contain redundancies that we don’t recognize owing to the complexity of the data space. But if the classifier makes inferences according to the four fundamental rules, it will automatically discard any redundant features.

9.4 Law of total probability or “extension of the conversation”

Suppose we have a set of \(n\) sentences \(\set{\mathsfit{H}_1, \mathsfit{H}_2, \dotsc, \mathsfit{H}_n}\) having these two properties:

  • They are mutually exclusive, meaning that the “and” of any two of them is false, given some background knowledge \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{H}_1\land\mathsfit{H}_2\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0\ , \quad \mathrm{P}(\mathsfit{H}_1\land\mathsfit{H}_3\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0\ , \quad \dotsc \ , \quad \mathrm{P}(\mathsfit{H}_{n-1}\land\mathsfit{H}_n\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0 \]

  • They are exhaustive, meaning that the “or” of all of them is true, given the background knowledge \(\mathsfit{Z}\):

    \[ \mathrm{P}(\mathsfit{H}_1\lor \mathsfit{H}_2 \lor \dotsb \lor \mathsfit{H}_n \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 1 \]

Then the probability of a sentence \(\mathsfit{X}\), conditional on \(\mathsfit{Z}\), is equal to a combination of probabilities conditional on \(\mathsfit{H}_1,\mathsfit{H}_2,\dotsc\):

Derived rule: extension of the conversation

\[ \begin{aligned} &\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \\[2ex] &\quad{}= \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_2 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_2 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \dotsb + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_n \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_n \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \end{aligned} \]

This rule is useful when it is difficult to assess the probability of a sentence conditional on the background information, but it is easier to assess the probabilities of that sentence conditional on several auxiliary sentences – often representing hypotheses1 that exclude one another, and of which we know at least one is true. The name extension of the conversation for this derived rule comes from the fact that we are able to call the additional sentences into play.

1 this is why we used the symbol \(\mathsfit{H}\) for these sentences

This situation occurs very often in concrete applications, especially in problems where the probabilities of several competing hypotheses have to be assessed.

9.5 Bayes’s theorem

The probably most famous – or infamous – rule derived from the laws of inference is Bayes’s theorem. It allows us to relate the probability where a sentence \(\mathsfit{Y}\) appear in the proposal and another \(\mathsfit{X}\) in the conditional, with the probability where they are exchanged:

Bayes’s theorem guest-starring in The Big Bang Theory
Derived rule: Bayes’s theorem

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \frac{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})}{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})} \]

Obviously this rule can only be used if \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) > 0\), that is, if the sentence \(\mathsfit{X}\) is not false conditional on \(\mathsfit{Z}\).


Prove Bayes’s theorem from the fundamental rules of inference.

Bayes’s theorem is extremely useful when we want to assess the probability of a sentence, typically a hypothesis, given some conditional, typically data; and we can easily assess the probability of the data conditional on the hypothesis. Note, however, that the sentences \(\mathsfit{Y}\) and \(\mathsfit{X}\) in the theorem can be about anything whatsoever: \(\mathsfit{Y}\) does not always need to be a “hypothesis”, and \(\mathsfit{X}\) “data”.

Study reading

9.6 Bayes’s theorem & extension of the conversation

Bayes’s theorem is often with several sentences \(\set{\mathsfit{Y}_1, \mathsfit{Y}_2, \dotsc, \mathsfit{Y}_n}\) that are mutually exclusive and exhaustive. Typically these represent competing hypotheses. In this case the probability of the sentence \(\mathsfit{X}\) in the denominator can be expressed using the rule of extension of the conversation:

Derived rule: Bayes’s theorem with extension of the conversation

\[ \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \frac{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})}{ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \dotsb + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_n \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_n \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) } \]

and similarly for \(\mathsfit{Y}_2\) and so on.

We will use this form of Bayes’s theorem very frequently.

9.7 The many facets of Bayes’s theorem

Bayes’s theorem is a very general result of the fundamental rules of inference, valid for any sentences \(\mathsfit{X},\mathsfit{Y},\mathsfit{Z}\). This generality leads to many uses and interpretations.

The theorem is often proclaimed to be the rule according to which we “update our beliefs”. The meaning of this proclamation is the following. Let’s say that at some point \(\mathsfit{Z}\) represents all your knowledge. Your degree of belief about some sentence \(\mathsfit{Y}\) is then (at least in theory) the value of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})\). At some later point, let’s say that you get to know – maybe thanks to an observation you made – that the sentence \(\mathsfit{X}\) is true. Your whole knowledge at that point is represented no longer by \(\mathsfit{Z}\), but by \(\mathsfit{X}\land \mathsfit{Z}\). Your degree of belief about \(\mathsfit{Y}\) is then given by the value of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land\mathsfit{Z})\). Bayes’s theorem allows you to find your degree of belief about \(\mathsfit{Y}\) conditional on your new state of knowledge, from the one conditional on your old state of knowledge.

This chronological element, however, comes only from this particular way of using Bayes’s theorem. The theorem can more generally be used to connect any two states of knowledge \(\mathsfit{Z}\) and \(\mathsfit{X}\land\mathsfit{Z}\), no matter their temporal order, even if they happen simultaneously, and even if they belong to two different agents.


Using Bayes’s theorem and the fundamental laws of inference, prove that if \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})=1\), that is, if you already know that \(\mathsfit{X}\) is true in your current state of knowledge \(\mathsfit{Z}\), then

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]

that is, your degree of belief about \(\mathsfit{Y}\) doesn’t change.

Is this result reasonable?

Study reading
  • §§ 4.1–4.3 in Medical Decision Making give one more point of view on Bayes’s theorem.

  • Ch. 3 of Probability

  • A graphical explanation of how Bayes’s theorem works mathematically (using a specific interpretation of the theorem):