9  Shortcut rules

Published

2024-09-28

The fundamental rules introduced in chapter  8 are all we need, and all an AI needs, in order to draw inferences from other inferences and from initial data.

From them, however, it is possible to derive some “shortcut” rules than can make the inferences shorter and faster. The situation is similar to what happens with some rules in algebra: for instance, we know that whenever we find the expression

\[ (a+b) \cdot (a-b) \]

then we can automatically substitute it with

\[ a^2 - b^2 \]

no matter the values of \(a\) and \(b\). The rule \((a+b) \cdot (a-b) = a^2-b^2\) is not a new algebraic rule: it’s simply the result of the application of the rules for addition \(+\) and multiplication \(\cdot\), and indeed we could just apply them directly:

\[ \begin{aligned} (a+b) \cdot (a-b) &=a\cdot a + b\cdot a - a\cdot b - b\cdot b \\ &=a^2 + b\cdot a - b\cdot a - b^2 \\ &=a^2 - b^2 \end{aligned} \]

But if we remember that they always lead to the result \(a^2-b^2\), then we can directly use the “shortcut” rule \((a+b) \cdot (a-b) = a^2-b^2\) and save ourselves some time.

 Likewise with the four rules of inference. Some particular sequences of application of the rules occur very often. We can then simply memorize the starting and final steps of these sequences, and use them directly, skipping all the steps in between. These shortcut rules are not only useful for saving time, however. We shall see that they reveal interesting and intuitive inference patterns, which are implicit in the four inference rules.

It is possible and legitimate to implement these shortcut rules in an AI agent, besides the four fundamental ones. Such an agent will arrive at the same results and decisions of an identical AI agent that doesn’t use the shortcut rules – but a little faster.

Here are the shortcut rules we’ll frequently use in the rest of the course:

9.1 Falsity and truth cannot be altered by additional knowledge

Suppose that sentence \(\mathsfit{X}\) is judged to be completely impossible, conditional on sentence \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0 \]

It can then be proved, from the fundamental rules, that \(\mathsfit{X}\) is also completely impossible if we add information to \(\mathsfit{Z}\). That is, for any sentence \(\mathsfit{Y}\) we’ll also have

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) = 0 \]

Exercise

Try to prove this. (Hint: try using the and-rule one or more times.)

What if we use \(\lnot\mathsfit{X}\) for \(\mathsfit{Y}\), that is, what if we acquire knowledge that \(\mathsfit{X}\) is actually true? Then it can be proved that all probability calculations break down. The problem is that \(\lnot\mathsfit{X}\) and \(\mathsfit{Z}\) turn out to be mutually contradictory, so all inferences are starting from contradictory premises. You probably know that in formal logic if we start from contradictory premises then we can obtain any conclusion whatsoever. The same happens with probability logic.

Note that this problem does not arise, however, if \(\mathsfit{X}\) is only extremely improbable conditional on \(\mathsfit{Z}\), say with a probability of \(10^{-100}\), rather than flat-out impossible. In practical applications we often approximate extremely small probabilities by \(0\), or extremely large ones by \(1\). If the probability calculations break down, we must then step back and correct the approximation.


By using the not-rule it is possible to prove that full certainty about a sentence behaves in a similar manner. If sentence \(\mathsfit{X}\) is judged to be completely certain conditional on sentence \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 1 \]

then, from the fundamental rules, \(\mathsfit{X}\) is also completely certain if we add information to \(\mathsfit{Z}\). That is, for any sentence \(\mathsfit{Y}\) we’ll also have

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) = 1 \]

Shortcut rules: permanence of truth and falsity

\[ \begin{aligned} &\text{if}\quad \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0\text{ or }1 \\ &\text{then}\quad \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \quad\text{for any $\mathsfit{Y}$ not contradicting $\mathsfit{Z}$} \end{aligned} \]


9.2 Boolean algebra

It is possible to show that all rules you may know from Boolean algebra are a consequence of the fundamental rules of § 8.4. So we can always make the following convenient replacements anywhere in a probability expression:

Shortcut rules: Boolean algebra

\[ \begin{gathered} \lnot\lnot \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\land \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\lor \mathsfit{X}= \mathsfit{X} \\[1ex] \mathsfit{X}\land \mathsfit{Y}= \mathsfit{Y}\land \mathsfit{X} \\[1ex] \mathsfit{X}\lor \mathsfit{Y}= \mathsfit{Y}\lor \mathsfit{X} \\[1ex] \mathsfit{X}\land (\mathsfit{Y}\lor \mathsfit{Z}) = (\mathsfit{X}\land \mathsfit{Y}) \lor (\mathsfit{X}\land \mathsfit{Z}) \\[1ex] \mathsfit{X}\lor (\mathsfit{Y}\land \mathsfit{Z}) = (\mathsfit{X}\lor \mathsfit{Y}) \land (\mathsfit{X}\lor \mathsfit{Z}) \\[1ex] \lnot (\mathsfit{X}\land \mathsfit{Y}) = \lnot \mathsfit{X}\lor \lnot \mathsfit{Y} \\[1ex] \lnot (\mathsfit{X}\lor \mathsfit{Y}) = \lnot \mathsfit{X}\land \lnot \mathsfit{Y} \end{gathered} \]

For example, if we have the probability

\[\mathrm{P}[\mathsfit{X}\lor (\mathsfit{Y}\land \mathsfit{Y}) \nonscript\:\vert\nonscript\:\mathopen{} (\lnot\lnot\mathsfit{Z}) \land \mathsfit{I}]\]

we can directly replace it with

\[\mathrm{P}[\mathsfit{X}\lor \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}\land \mathsfit{I}]\]


The derivation of the Boolean-algebra rules from the four inference rules is somewhat involved. As as example, a partial proof of the rule \(\mathsfit{X}\land \mathsfit{X}= \mathsfit{X}\), called “and-idempotence” goes as follows:

\[ \begin{aligned} &\mathrm{P}(\mathsfit{X}\land \mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& \\[1ex] &\qquad= \mathrm{P}(\mathsfit{X}| \mathsfit{X}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& &&\text{\small ∧-rule} \\[1ex] &\qquad= 1 \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})&& &&\text{\small truth-rule} \\[1ex] &\qquad= \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \end{aligned} \]

and with a similar procedure it can be shown that \(\mathsfit{X}\land \mathsfit{X}\) can be replaced with \(\mathsfit{X}\) no matter where it appears. The above proof shows that the and-idempotence rule is tightly connected with the truth-rule of inference.

9.3 Law of total probability or “extension of the conversation”

Suppose we have a set of \(n\) sentences \(\set{\mathsfit{H}_1, \mathsfit{H}_2, \dotsc, \mathsfit{H}_n}\) having these two properties:

  • They are mutually exclusive, meaning that the “and” of any two of them is false, given some background knowledge \(\mathsfit{Z}\):

\[ \mathrm{P}(\mathsfit{H}_1\land\mathsfit{H}_2\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0\ , \quad \mathrm{P}(\mathsfit{H}_1\land\mathsfit{H}_3\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0\ , \quad \dotsc \ , \quad \mathrm{P}(\mathsfit{H}_{n-1}\land\mathsfit{H}_n\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 0 \]

  • They are exhaustive, meaning that the “or” of all of them is true, given the background knowledge \(\mathsfit{Z}\):

    \[ \mathrm{P}(\mathsfit{H}_1\lor \mathsfit{H}_2 \lor \dotsb \lor \mathsfit{H}_n \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) = 1 \]

In other words, according to our background knowledge, one of those sentences must be true, but only one.

Then the probability of a sentence \(\mathsfit{X}\), conditional on \(\mathsfit{Z}\), is equal to a combination of probabilities conditional on \(\mathsfit{H}_1,\mathsfit{H}_2,\dotsc\):

Shortcut rule: extension of the conversation

\[ \begin{aligned} &\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \\[2ex] &\quad{}= \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_2 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_2 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \dotsb + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{H}_n \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{H}_n \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \end{aligned} \]

This rule is useful when it is difficult to assess the probability of a sentence conditional on the background information, but it is easier to assess the probability of that sentence conditional on several auxiliary “scenarios” or hypotheses1. The name extension of the conversation for this shortcut rule comes from the fact that we are able to call these additional scenarios or hypotheses into play. This situation occurs very often in concrete applications.

1 this is why we used the symbol \(\mathsfit{H}\) for these sentences

9.4 Bayes’s theorem

The probably most famous – or infamous – rule derived from the laws of inference is Bayes’s theorem. It allows us to relate the probability of a proposal \(\mathsfit{Y}\) and a conditional \(\mathsfit{X}\) to the probability where their proposal-conditional roles are exchanged:

Bayes’s theorem guest-starring in The Big Bang Theory
Shortcut rule: Bayes’s theorem

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \frac{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})}{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})} \]

Obviously this rule can only be used if \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) > 0\), that is, if the sentence \(\mathsfit{X}\) is not false conditional on \(\mathsfit{Z}\).

Bayes’s theorem is extremely useful when we want to assess the probability of a hypothesis (the proposal) given some data (the conditional), and it is easy to assess the probability of the data conditional on the hypothesis. Note, however, that the sentences \(\mathsfit{Y}\) and \(\mathsfit{X}\) in the theorem can be about anything whatsoever: \(\mathsfit{Y}\) doesn’t always need to be a “hypothesis”, and \(\mathsfit{X}\) doesn’t always need to be “data”.

Exercise

Prove Bayes’s theorem from the fundamental rules of inference.

Study reading

9.5 Bayes’s theorem & extension of the conversation

Bayes’s theorem is often used with several sentences \(\set{\mathsfit{Y}_1, \mathsfit{Y}_2, \dotsc, \mathsfit{Y}_n}\) that are mutually exclusive and exhaustive. Typically these represent competing hypotheses. In this case the probability of the sentence \(\mathsfit{X}\) in the denominator can be expressed using the rule of extension of the conversation:

Shortcut rule: Bayes’s theorem with extension of the conversation

\[ \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \frac{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})}{ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_1 \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_1 \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \dotsb + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}_n \land \mathsfit{Z})\cdot \mathrm{P}(\mathsfit{Y}_n \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) } \]

and similarly for \(\mathsfit{Y}_2\) and so on.

We will use this form of Bayes’s theorem very frequently.

9.6 The many facets of Bayes’s theorem

Bayes’s theorem is a very general result of the fundamental rules of inference, valid for any sentences \(\mathsfit{X},\mathsfit{Y},\mathsfit{Z}\). This generality leads to many uses and interpretations.

The theorem is often proclaimed to be the rule for “updating an agent’s beliefs”. The meaning of this proclamation is the following. Let’s say that at some point \(\mathsfit{Z}\) represents all the agent’s knowledge. The agent’s degree of belief about some sentence \(\mathsfit{Y}\) is then (at least in theory) the value of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})\). At some later point, the agent gets to know – maybe thanks to an observation or measurement – that the sentence \(\mathsfit{X}\) is true. The agent’s whole knowledge at that point is represented no longer by \(\mathsfit{Z}\), but by \(\mathsfit{X}\land \mathsfit{Z}\). The agent’s degree of belief about \(\mathsfit{Y}\) is then given by the value of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land\mathsfit{Z})\). Bayes’s theorem allows us to find the agent’s degree of belief about \(\mathsfit{Y}\) conditional on the new state of knowledge, from the one conditional on the old state of knowledge.

This chronological element, however, comes only from this particular way of using Bayes’s theorem. The theorem can more generally be used to connect any two states of knowledge \(\mathsfit{Z}\) and \(\mathsfit{X}\land\mathsfit{Z}\), no matter their temporal order, even if they happen simultaneously, and even if they belong to two different agents.

Exercise

Using Bayes’s theorem and the fundamental laws of inference, prove that if \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})=1\), that is, if you already know that \(\mathsfit{X}\) is true in your current state of knowledge \(\mathsfit{Z}\), then

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]

that is, your degree of belief about \(\mathsfit{Y}\) doesn’t change (note that this is different from the rule of truth-permanence of § 9.1).

Is this result reasonable?

Study reading
  • §§4.1–4.3 in Medical Decision Making give one more point of view on Bayes’s theorem.

  • Ch. 3 of Probability

  • A graphical explanation of how Bayes’s theorem works mathematically (using a specific interpretation of the theorem):

9.7 Importance of seemingly trivial rules

Some of the fundamental or shortcut rules may seem obvious or unimportant, but are of extreme importance in data science. For instance, the and-idempotence rule    \(\mathsfit{X}\land\mathsfit{X}= \mathsfit{X}\)   effectively asserts that whenever we draw inferences, redundant information or data is automatically counted only once.

This amazing feature saves us from a lot of headaches. Imagine that an AI decision agent at the assembly line has been given the following background information: if an electronic component passes the heating test (\(\mathsfit{h}\)), then its probability of early failure (\(\mathsfit{f}\)) is only 10%:

\[\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) = 0.1\]

Now let’s say that a new voltage test has also been devised, and if a component passes this test (\(\mathsfit{v}\)) then its probability of early failure is also 10%:

\[\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{v}\land \mathsfit{Z}) = 0.1\]

However, it is discovered that the voltage test works in exactly the same way as the heating test – they’re basically the same test! \(\mathsfit{v}=\mathsfit{h}\). This means that if an element passes the heating test then it will automatically pass the voltage test, and vice versa (they’re the same test!):2

2 We are assuming that a test, if repeated, will always give the same result.

\[\mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) = 1\]

or equivalently   \(\mathsfit{v}\land \mathsfit{h}= \mathsfit{h}\land \mathsfit{h}= \mathsfit{h}\).

Now suppose that inadvertently we give our AI agent the redundant information that an electronic component has passed the heating test and the voltage test. What will the agent say about the probability of early failure, given this duplicate information? will it count the test twice? Let’s calculate:

\[ \begin{aligned} &\mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{v}\land \mathsfit{h}\land \mathsfit{Z})&& \\[1ex] &\qquad= \frac{\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z})}{ \mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) } &&\text{\small ∧-rule} \\[1ex] &\qquad= \frac{\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z})}{1} =\mathrm{P}(\mathsfit{f}\land \mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small initial probability} \\[1ex] &\qquad= \mathrm{P}(\mathsfit{v}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{h}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small ∧-rule} \\[1ex] &\qquad= 1 \cdot \mathrm{P}(\mathsfit{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{h}\land \mathsfit{Z}) &&\text{\small truth cannot be altered} \\[1ex] &\qquad= 0.1 &&\text{\small initial probability} \end{aligned} \]

The AI agent, thanks to the truth-rule or equivalently the and-idempotence rule, correctly detected the redundancy of the sentence \(\mathsfit{v}\) (“the element passed the voltage test”) and automatically discarded it.

This feature is of paramount importance in machine learning and data-driven engineering: the “features” that we give as an input to a machine-learning classifier could contain redundancies that we don’t recognize, owing to the complexity of the data space. But if the classifier makes inferences according to the four fundamental rules, it will automatically discard any redundant features.