8 Probability inference

Published

2025-11-18

In most engineering and data-science problems we don’t know the truth or falsity of outcomes and hypotheses that interest us. But this doesn’t mean that nothing can be said or done in such situations. Now we shall finally see how to draw uncertain inferences, that is, how to calculate the probability of something that interests us, given particular data, information, and assumptions.

So far we have used the term “probability” somewhat informally and intuitively. It’s time to make it more precise and to emphasize some of its most important aspects, especially for Artificial Intelligence. Then we’ll dive into the rules of probability-inference.

8.1 When truth isn’t known: beliefs and probability

If we want to design an AI agent that can make decisions and act in uncertain situations, we need to equip it with probabilities and utilities. This is an inescapable necessity, as discussed in § 2.5. It is intuitively clear what utilities are: they quantify how desirable different outcomes are to the agent. What do probabilities quantify?

A first tentative answer could be this: the probability of an outcome quantifies how often the agent has seen that outcome. That is, it represents the observed frequency of that outcome.

Although it seems reasonable in many respects, the answer above still doesn’t capture several elements that appear in the way we make rational inferences, and that are extremely important in designing an AI agent that can operate in uncertain conditions.

To see what’s missing, consider the following scenario:

Someone shows you and a friend of yours, called Aisha, a mechanical device designed to toss a coin. The device looks complicated; neither of you has a full grasp of its design. A normal coin is placed on the device, and your friend Aisha is required to place a bet on whether the tossed coin will land heads or tails. If she guesses correctly she wins $1; nothing otherwise. Note that not betting is not an option: feel free to imagine circumstances (possibly very nasty) where your friend can’t refuse to bet. Let’s also make clear that no “cheating” is taking place: for instance, the outcome does not depend on the betting choice (you can imagine that the coin is tossed right before Aisha bets, but the outcome hidden from her until she bets). From now on you can only observe Aisha’s choices and hear what she says, but cannot talk with her.

In this situation your friend says that it does not matter whether she bets on heads or tails; she has no more belief in one outcome than the other. In fact, she decides by tossing a coin she had in her pocket.

Do you think Aisha’s judgement and choice are rational?

Aisha’s conclusion comes not only from the fact that the win is the same in either outcome, but also from the fact that her beliefs in the heads outcome and in the tails outcome are equal. If one belief were stronger than the other, she would definitely bet on that outcome instead.

Yet this equality of beliefs does not come from an equality of frequencies. In fact neither you or your friend have observed any frequencies whatsoever; you have never seen a similar tossing device in operation before. For all you know, the device might be designed to always produce heads, or always tails, or maybe one or the other with peculiar frequencies not equal to 50%/50%.

Let’s continue:

Now you and your friend observe the toss outcome. It’s tails. The device is prepared again, exactly as in the first toss, and Aisha is required to bet again. However, now they propose Aisha two different bets: betting tails and guessing correctly, she wins $1; betting heads and guessing correctly, she wins $2. (And remember: the outcome won’t depend on her bet; no cheating is taking place).

Aisha says the following. If she had been proposed a $1-win on either heads or tails, then she would have bet on tails, just because the device had shown tails once. But she still has no much more belief in tails than heads; so given the double win on heads, she now bets heads.

Do you think Aisha’s judgement and choice are rational?

Clearly Aisha’s beliefs in the two possible outcomes are almost equal. And this near-equality is different from the observed frequency: the frequency right now is 100% tails. The frequency does affect Aisha’s belief a little (she’d choose tails if offered an equal bet), but is different from her belief.

Fast-forward in time:

The device is repeatedly used, say for 1000 times (plus some extra times until you or your friend says “stop”, if you like). In these repetitions you and Aisha count that heads occurred 781 times, and tails 219 times, without any recognizable pattern. Now Aisha is asked again to place a bet on the next toss.

Aisha says that she believes more strongly that the coin will land heads than tails. She even quantifies her belief to around 78% for heads and 22% for tails. It’s very close or identical to the frequency you both observed. Aisha bets accordingly.

Do you think Aisha’s judgement and choice are rational?

Now we can say that her beliefs and the observed frequencies are aligned.

Final part of this scenario:

Finally, another friend of yours, called Aiden, is brought in, and shown the device. Aiden cannot communicate with you or Aisha; he’s in the same situation she and you were around 1000 tosses ago. Aiden is required to place a bet, and he says that he has no preferences for betting on heads or tails – with exactly the same reasoning Aisha did 1000 tosses ago.

Do you think Aiden’s judgement and choice are rational?

If you think that Aisha’s and Aiden’s behaviour and choices were rational, consider a situation in which two AI agents were in their place. It is apparent that we need to equip an AI agent with a sort of quantified, rational “degree of belief” in order for it to make rational decisions in uncertain situations. This is also the result discussed in § 2.5. This degree of belief can have connections with an observed frequency, but is generally different from it. The two notions must therefore be kept separate, and quantified separately. In fact, we may think of situations in which an agent may be uncertain about a frequency and needs to quantify its own belief about the frequency; we’ll meet these situations in the Inference III part.

What we call probability is this quantified degree of belief:

The probability of a sentence is an agent’s quantified degree of belief in that sentence.

We shall take the notion of degree of belief as intuitively understood, just as we did with the notion of truth. We shall use the terms probability, degree of belief, belief, plausibility, credibility¹ as synonyms.

¹ credibility literally means “believability” (from Latin credo = to believe).

Probabilities are quantified between $0$ and $1$, or equivalently between $0\%$ and $100\%$. Assigning to a sentence a probability 1 is the same as saying that it is true; and a probability 0, that it is false. A probability of 0.5 represents a belief completely symmetric with respect to truth and falsity.

Said otherwise, if an agent assigns to a sentence a probability 1, it means that the agent is completely certain that the sentence is true. If the agent assigns a probability 0, it means that the agent is completely certain that the sentence is false. If the agent assigns a probability 0.5, it means that the agent is equally uncertain about the truth as about the falsity of the sentence.

8.2 Important aspects of probabilities

From our discussion of the tossing-device scenario above we can gather several important aspects and facts about probabilities:

Probabilities are not frequencies

This fact is clear by now. Indeed:

there can be a degree of belief when no frequencies are available;
a degree of belief can be very different from observed frequencies;
a degree of belief can be equal to an observed frequency.

One remarkable feature of the framework that we’re going to study is that the quantitative connection between these two notions is taken care of automatically! We shall see how this connection works in the Inference III part.

Note also the different status of the notions “frequency” and “probability/belief” from an epistemological point of view.² Frequencies can be unknown to some agents. Degrees of belief cannot be “unknown”: the agent must have them in order to act. At worst, degrees of belief can be difficult to calculate.

² That is, from the point of view of an agent’s knowledge.

Be careful when you read authors speaking of an “unknown probability”: they actually mean either “unknown frequency”, or a probability that has to be calculated; it’s “unknown” in the same sense that the value of “$1-0.7 \cdot 0.2/(1-0.3)$” is “unknown” to you right now.

Probabilities are agent- and knowledge-dependent

The tossing-device scenario shows that different agents can have different probabilities, that is, degrees of belief, about the same situation.

This happened when Aiden entered the scene. Aisha had beliefs around 78% for heads and 22% for tails; but Aiden had 50%/50% beliefs for the same toss. Both sets of beliefs were rational and appropriate to their respective situations. They simply reflected the different states of knowledge of the agents that held them.

An omniscient agent would know the truth or falsity of every sentence, and assign only probabilities 0 or 1. Some literature speaks of “actual (but unknown) probabilities”. But if there were “actual” probabilities, they would be all 0 or 1, and it would be pointless to speak about probabilities at all – every inference would be a truth-inference.

Probabilities are not physical properties

The fact that two agents can hold different probabilities in the same situation also shows that probabilities are not physical properties, which could be objectively measured with some meter.

Whether a tossed coin lands heads or tails is fully determined by the initial conditions (position, orientation, momentum, rotational momentum) of the toss and the boundary conditions (air velocity and pressure) during the flight. The same is true for all macroscopic engineering phenomena (even quantum phenomena have never been proved to be non-deterministic, and there are deterministic and experimentally consistent mathematical representations of quantum theory). So we cannot measure a probability using some physical apparatus.

Study reading

Skim through Diaconis & al. 2007: Dynamical Bias in the Coin Toss.

We can objectively measure frequencies, in several instances of a phenomenon. Frequencies, as opposed to probabilities, are physically measurable quantities. This shows again the difference between probabilities and frequencies.

Probabilities are assigned to sentences

We already discussed this point in § 6.3, but let’s reiterate it. Consider an engineer working on a problem of electric-power distribution in a specific geographical region. At a given moment the engineer may believe with $75\%$ probability that the measured average power output in the next hour will be 100 MW. The $75\%$ probability is assigned not to the quantity “100 MW”, but to the sentence

\[ \textsf{\small`The measured average power output in the next hour will be 100\,MW'} \]

This difference is extremely important. Consider the alternative sentence

\[ \textsf{\small`The average power output in the next hour will be \emph{set} to 100\,MW'} \]

the numerical quantity is the same, but the meaning is very different. The probability can therefore be very different. If the engineer is the person who decides how to set that output, and has decided to set it to 100 MW, then the probability is obviously $100\%$ (or very close to), because the engineer already knows what the output will be. The probability depends not only on a number, but on what it’s being done with that number: measuring, setting, third-party reporting, and so on. Often we write simply “$O \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}10\,\mathrm{W}$”, provided that the full sentence behind this shorthand is understood.

The points listed above are not just a matter of principle. They have important practical consequences. A data scientist who is not attentive to the source of the data (measured? set? reported, and so maybe less trustworthy?), or who does not carefully assess the context of a probability, or who mixes a probability with a frequency, or who does not take advantage (when possible) of the physics involved in the a problem – such data scientist will design systems with sub-optimal performance³ – or even cause deaths.

³ This fact can be mathematically proven.

8.3 The many uses of the word “probability”

The terms “degree of belief” and “frequency” are quite distinct and are used more or less consistently in the literature. Whenever you encounter these terms you more or less know what’s intended.

Sadly the situation is completely different with the term “probability”, which is used in the literature in incompatible ways. Some literature uses this term as a synonym of “frequency”. Other literature uses it as a synonym of “degree of belief”, as we do; use of “probability” as “degree of belief” is called Bayesian probability theory.

Some recent literature in machine learning uses “probability” in yet another way, to denote the numeric output of some machine-learning algorithms. This numeric output is neither a frequency or a degree of belief, and only has vague associations with them; we’ll discuss this in ch. 42.

There are also a couple more different uses of the term “probability” in the literature. It’s a mess.

In this course we could have stuck to the terms “degree of belief” and “frequency” only, avoiding the problematic “probability”. But such a choice would not be very helpful to you, because you will nevertheless encounter this term in your readings and scientific discussions. In the AI literature the most common use is as “degree of belief”, so we also adopt it, sometimes using “(degree of) belief” and sometimes “probability”, interchangeably.

But beware of this term when you read the literature, or in your scientific discussions. You must try to understand the intended meaning from the context. You’re also free to choose (preferably consistently) the terminology you like most. What’s important is that the notions underlying these words are clear to you.

Beware of likelihood as a synonym for probability

In everyday language, “likelihood” is synonym with “probability”. In technical writings about probability or statistics, however, “likelihood” means something different and is not a synonym of “probability”, as we explain below (§ 8.10.1).

8.4 An unsure inference. Probability notation

Consider now the following variation of the trivial inference problem of § 7.1.

This electric component had an early failure. If an electric component fails early, then at production either it didn’t pass the heating test or it didn’t pass the shock test. The probability that it passed neither test (that is, both tests failed) is 10%. There’s no reason to believe that the component passed the heating test, more than to believe that it passed the shock test.

Again the inspector wants to assess whether the component did not pass the heating test.

From the data and information given, what would you say is the probability that the component didn’t pass the heating test?

Exercise 8.1

Try to argue why a conclusion cannot be drawn with certainty in this case. One way to argue this is by presenting two different scenarios that fit the given data but have opposite conclusions.
Try to reason intuitively and assess the probability that the component didn’t pass the heating test. Should it be larger or smaller than 50%? Why?

For this inference problem we cannot find a true or false final value. The truth-inference rules (7.1)–(7.4) therefore cannot help us here. In fact even the “$\mathrm{T}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \dotso)$” notation is unsuitable, because it only admits the values $1$ (true) and $0$ (false).

Let us first generalize this notation in a straightforward way:

First, let’s represent the probability or degree of belief of a sentence by a number in the range $[0,1]$, that is, between $\mathbf{1}$ (certainty or true) and $\mathbf{0}$ (impossibility or false). The value $0.5$ represents a belief in the truth of the sentence which is as strong as the belief in its falsity.

Second, let’s symbolically write in the following way that the probability of a proposal $\mathsfit{Y}$, given a conditional $\mathsfit{X}$, is some number $p$:

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = p \]

Note that this notation includes the notation for truth-values as a special case:

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1 \quad\Longleftrightarrow\quad \mathrm{T}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1 \]

8.5 Inference rules

Extending our truth-inference notation to probability-inference notation has been straightforward. But which rules should we use for drawing inferences when probabilities are involved?

The amazing result is that the rules for truth-inference, formulae (7.1)–(7.4), extend also to probability-inference. The only difference is that they now hold for all values in the range $[0,1]$, rather than only for $0$ and $1$.

This important result was taken more or less for granted at least since Laplace in the 1700s, but was formally proven for the first time in 1946 by R. T. Cox. The proof has been refined since then. What kind of proof is it? It shows that if we don’t follow the rules we are doomed to arrive at illogical conclusions; we’ll show some examples later.

Finally, here are the fundamental rules of all inference. They are encoded by the following equations, which must always hold for any atomic or composite sentences $\mathsfit{X},\mathsfit{Y},\mathsfit{Z}$:

THE FUNDAMENTAL LAWS OF INFERENCE

$\boldsymbol{\lnot}$ “Not” rule: \[\mathrm{P}(\lnot \mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 1 - \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})\]
$\boldsymbol{\land}$ “And” rule: \[ \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]
$\boldsymbol{\lor}$ “Or” rule: \[\mathrm{P}(\mathsfit{X}\lor \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) - \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]
Truth rule: \[\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = 1 \]

How to use the rules:
Each equality can be rewritten in different ways according to the usual rules of algebra. Then the resulting left side can be replaced by the right side, and vice versa. The numerical values of starting inferences can be replaced in the corresponding expressions.

It is amazing that ALL inference is nothing else but a repeated application of these four rules – maybe billions of times or more. All machine-learning algorithms are just applications or approximations of these rules. Methods that you may have heard about in statistics are just specific applications of these rules. Truth inferences are also special applications of these rules. Most of this course is just a study of how to apply these rules to particular kinds of problems.

Study reading

Read:

Ch. 2 of Gregory: Bayesian Logical Data Analysis for the Physical Sciences
Ch. 1 of O’Hagan: Probability
§§1.0–1.2 of Sivia: Data Analysis

Skim through:

Cox 1946: Probability, Frequency and Reasonable Expectation. Try to get the ideas behind the reasoning, even if you can’t follow the mathematical details.
Chs 1–2 of Jaynes: Probability Theory

The fundamental inference rules are used in the same way as their truth-inference counterpart of [§@truth-inference-rules]: Each equality can be rewritten in different ways according to the usual rules of algebra. The left and right side of the equality thus obtained can replace each other in a proof.

8.6 Solution of the uncertain-inference example

Armed with the fundamental rules of inference, let’s solve our earlier inference problem. As usual, we first analyse it and represent it in terms of atomic sentences; we find what are its proposal and conditional; and we find which initial inferences are given in the problem.

1. Atomic sentences

\[ \begin{aligned} \mathsfit{h}&\coloneqq\textsf{\small`The component passed the heating test'} \\ \mathsfit{s}&\coloneqq\textsf{\small`The component passed the shock test'} \\ \mathsfit{f}&\coloneqq\textsf{\small`The component had an early failure'} \\ \mathsfit{J}&\coloneqq\textsf{\small (all other implicit background information)} \end{aligned} \]

The background information in this example is different from the previous, truth-inference one, so we use the different symbol $\mathsfit{J}$ for it.

2. Proposal, conditional, and target inference

The proposal is $\lnot\mathsfit{h}$, just like in the truth-inference example.

The conditional is different now. We know that the component failed early, but we don’t know whether it passed the shock test. Hence the conditional is $\mathsfit{f}\land \mathsfit{J}$.

The target inference is therefore

\[ \color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) \]

3. Starting inferences

We are told that if an electric component fails early, then at production it didn’t pass the heating test or the shock test (or neither). This is given as a sure fact. Let’s write it as

\[ \color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{h}\lor \lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = 1 \tag{8.1}\]

We are also told that there is a $10\%$ probability that both tests fail:

\[ \color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{h}\land \lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = 0.1 \tag{8.2}\]

Finally the problem says that there’s no reason to believe that the component didn’t pass the heating test, more than it didn’t pass the shock test. This can be written as follows:

\[ \color[RGB]{34,136,51}\mathrm{P}(\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = \mathrm{P}(\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) \tag{8.3}\]

Note the interesting situation above: we are not given the numerical values of these two probabilities; we are only told that they are equal. This is an example of application of the principle of indifference, which we’ll discuss more in detail later.

Finding the target inference

Also in this case there is no unique way of applying the rules to reach our target inference, but all paths will lead to the same result. Let’s try to proceed backwards:

\[ \begin{aligned} &\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})&& \text{\small ∨-rule} \\[1ex] &\qquad= {\color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{s}\lor \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} + {\color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} - \mathrm{P}(\lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) && \text{\small starting inferences (8.1–2)} \\[1ex] &\qquad= {\color[RGB]{34,136,51}1} + {\color[RGB]{34,136,51}0.1} - \mathrm{P}(\lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) && \text{\small ¬-rule} \\[1ex] &\qquad= 0.1 + \color[RGB]{34,136,51}\mathrm{P}(\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) && \text{\small starting inference (8.3)} \\[1ex] &\qquad= 0.1 + \mathrm{P}(\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) && \text{\small ¬-rule} \\[1ex] &\qquad= 0.1 + 1 -\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) && \end{aligned} \]

The target probability appears on the left and right side with opposite signs. We can solve for it:

\[ \begin{aligned} 2\,{\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} &= 0.1 + 1 \\[1ex] {\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} &= 0.55 \end{aligned} \]

So the probability that the component didn’t pass the heating test is $55\%$.

Exercise 8.2

Try to find an intuitive explanation of why the probability is 55%, slightly larger than 50%. If your intuition says this probability is wrong, then:
- Check the proof of the inference for mistakes, or try to find a proof with a different path.
- Examine your intuition critically and educate it.
Check how the target probability $\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})$ changes if we change the value of the probability $\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})$ from $0.1$.
- What result do we obtain if $\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=0$? Can it be intuitively explained?
- What if $\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=1$? Does the result make sense?

8.7 Use and implementation of the inference rules

In the step-wise solution above you noticed that the equations of the fundamental rules were not only used to calculate the probability on their left side, given those on their right side. For example, in the very first step, when we went from the probability

\[\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})\]

to the sum

\[\mathrm{P}(\lnot\mathsfit{s}\lor \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) + \mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) - \mathrm{P}(\lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) \,, \]

we used the or-rule rewritten, by simple algebra, as follows:

\[ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}{\color[RGB]{238,102,119}\lor} \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) - \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \,. \]

The four rules in fact represent, first of all, constraints of logical consistency (the precise technical term is coherence) among probabilities. For instance, if we have probabilities

\[\begin{aligned} \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X}\land \mathsfit{Z}) &= 0.1 \\ \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) &= 0.7 \\ \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) &= 0.2 \,, \end{aligned}\]

then there must be an inconsistency in the agent’s degrees of belief, because these values violate the and-rule:

\[0.2 \ne 0.1 \cdot 0.7 \,.\]

When this happens, the inconsistency must be found and solved. (However, since probabilities are quantified by real numbers, it’s possible and acceptable to have slight discrepancies within numerical round-off errors.)

The fundamental rules also imply more general constraints. For example we must always have

\[ \begin{gathered} \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) \le \min\set[\big]{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}),\ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})} \\ \mathrm{P}(\mathsfit{X}\lor \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) \ge \max \set[\big]{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}),\ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})} \end{gathered} \]

Exercise 8.3

Try to prove the two constraints above from the four fundamental rules.
Translate the two constraints above verbally. Do they make intuitive sense? (For example, the first constraint says, very roughly speaking, that our belief in two things can’t be stronger than our belief in either one of the two things alone.)

In following the step-wise solution you may have been asking yourself after some steps: “OK, why this peculiar step now, and not some other step?”. This is a very intelligent question. You essentially noticed the lack of an algorithm. Compare this situation with the solution of a basic decision problem: there we had a clear sequence of steps to do: writing down some tables, do some element-wise multiplications and then some sums, and finally find the largest of a list of numbers.

Is there an algorithmic way of drawing inferences, that is, of calculating some target probabilities given some initial ones? If our inference problem involves a finite number of sentences, then the answer is yes, but with some caveats.

Given a set of initial probabilities, and a target probability whose value we want to find, there is an algorithm that yields the minimum and the maximum possible values of the target probability. Specifically it can yield the following results (note that some are particular instances of others):

For the extra curious

The complete proof that inference problems can be solved this way seems to have been given first by Hailperin:

although particular cases were explored already by Boole.

No values: This means that the initial probabilities have inconsistent values; that is, they violate some of the four rules, as in the “$0.2 \ne 0.1 \cdot 0.7$” example above.
$0$ and $1$: That is, the target probability is completely unspecified. This means that the initial probabilities are not sufficient to determine the target probability. For example, if we have $\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.2$, and $\mathsfit{Y}$ is a sentence completely unrelated to $\mathsfit{X}$ (no atomic sentences in common), then $\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})$ could have any value.
Values $v$, $V$ with $v < V$ and at least one different from 0 or 1: That is, the target probability cannot have any value whatsoever, but a its value is still unspecified. This means that the initial probabilities constrain the target probability somewhat, but do not determine it. The two constraints discussed above are an example of this: if we have $\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.2$ and $\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.3$, the we can say that $0 \le \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \le 0.2$, but its precise value is not determined.
One value $p$: The target probability is completely determined by the initial ones. This occurred in our example inference about the electronic component.

This algorithm is not mathematically difficult, but it is somewhat involved. It boils down to: (1) writing the sentences in the initial and target probabilities in disjunctive normal form; (2) rewriting the initial and target probabilities as sum of probabilities of basic conjunctions; (3) solving two linear-fractional optimization problems, equivalent to linear optimization oens (for which there are algorithms available).

8.8 An implementation in R

The algorithm described in the previous section is implemented in the R function inferP(); it requires the lpSolve R package, which we installed in the R introduction.

The inferP() function takes as first target = argument the probability we want to find, and as remaining arguments the values of the given probabilities, or equalities among probabilities. We must use the following notation:

$\lnot$ becomes -
$\land$ becomes & or equivalently *
$\lor$ becomes +
$=$ becomes ==
conditional bar $\nonscript\:\vert\nonscript\:\mathopen{}$ becomes |

The output are the minimum and maximum possible values of the target probability, or NA if the starting probabilities are inconsistent.

We can apply this function in the example above about the electronic component:

source('inferP.R') ## load the function

inferP(
    ## target probability:
    target = P(-h | f & J),
    ## given probabilities:
    P(-h + -s | f & J) == 1,
    P(-h & -s | f & J) == 0.1,
    P(h | f  & J) == P(s | f & J)
)

 min  max 
0.55 0.55

Obviously the algorithm becomes more expensive, the larger the number of sentences in the inference problem. In inference problems involving continuous quantities, such as energy, such an algorithm cannot be applied in practice. Also, the way the algorithm above works cannot be represented as a sequence of “logical” steps with consecutive applications of the rules, as shown in the step-wise solution above. Later on in this course we shall focus on specific kinds of inferences for which other, less opaque inference algorithms are available.

Exercise 8.4

Play with the function inferP(): test it in other inference problems, find problems where there is no solution and others where the min and max values are different.
Consider the following inference:

inferP(
    target = P(hypothesis  |  evidence1 & evidence2 & I),
    P(hypothesis  |  evidence1 & I) == 0.1,
    P(hypothesis  |  evidence2 & I) == 0.9
)

min max 
  0   1

What does this result tells us about “combining evidence” to prove a hypothesis?

8.9 Consequences of not following the rules

The fundamental rules of inference guarantee that the agent’s uncertain reasoning is self-consistent, and that it follows logic when there’s no uncertainty. Breaking the rules means that the resulting inference has some logical or irrational inconsistencies.

There are many examples of inconsistencies that appear when the rules are broken. Imagine for instance an agent that gives an 80% probability that it rains⁴ in the next hour; and it also gives a 90% probability that it rains and that the average wind is above 3⋅m/s in the next hour. This is clearly unreasonable, because the raining scenario alone would be true with wind above 3 m/s and also below 3⋅m/s – therefore it should be more probable than the scenario where the wind is above 3 m/s. And indeed the two given probabilities break the and-rule, showing that they are unreasonable or illogical.

⁴ to be precise, let’s say “it rains above 1 mm”

Exercise 8.5

Prove that the two probabilities in the example above break the and-rule. (Hint: you must use the fact that probabilities are numbers between 0 and 1, and that multiplying a number by something between 0 and 1 can only yield a smaller number.)

Study reading

Read §12.2.3 in Artificial Intelligence
As you continue your studies, skim through chapters 4–8 of Hastie & Dawes: Rational Choice in an Uncertain World, just to get the main messages and an overview of curious psychological phenomena.

8.10 Remarks on terminology and notation

Likelihood

In everyday language, “likely” is often a synonym of “probable”, and “likelihood” of “probability”. But in technical writings about probability, inference, and decision-making, “likelihood” has a very different meaning. Beware of this important difference in definitions:

$\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})$ is:

the probability of $\mathsfit{Y}$ given $\mathsfit{X}$ (or conditional on $\mathsfit{X}$),
the likelihood of $\mathsfit{X}$ in view of $\mathsfit{Y}$.

We can also say:

the probability of $\mathsfit{Y}$ given $\mathsfit{X}$, is $\mathrm{P}({\color[RGB]{68,119,170}\mathsfit{Y}}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})$.
the likelihood of $\mathsfit{Y}$ in view of $\mathsfit{X}$, is $\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}{\color[RGB]{68,119,170}\mathsfit{Y}})$.

Probability and likelihood can be very different

A priori there is no relation between the probability and the likelihood of a sentence $\mathsfit{Y}$: this sentence could have very high probability and very low likelihood, and vice versa.

In these notes we’ll avoid the possibly confusing term “likelihood”. All we need to express can be phrased in terms of probability.

Omitting background information

In the analyses of the inference examples of § 7.1 and § 8.4 we defined sentences ($\mathsfit{I}$ and $\mathsfit{J}$) expressing all background information, and always included these sentences in the conditionals of the inferences – because those inferences obviously depended on that specific background information.

In many concrete inference problems the background information usually stays in the conditional from beginning to end, while the other sentences jump around between conditional and proposal as we apply the rules of inference. For this reason the background information is often omitted from the notation, being implicitly understood. For instance, if the background information is denoted $\mathsfit{I}$, one writes

“$\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})$” instead of $\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X}\land \mathsfit{I})$
“$\mathrm{P}(\mathsfit{Y})$” instead of $\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})$

This is what’s happening in books where you see “$P(x)$” without conditional.

Such practice may be convenient, but be wary of it, especially in particular situations:

In some inference problems we suddenly realize that we must distinguish between cases that depend on hypotheses, say $\mathsfit{H}_1$ and $\mathsfit{H}_2$, that were buried in the background information $\mathsfit{I}$. If the background information $\mathsfit{I}$ is explicitly reported in the notation, this is no problem: we can rewrite it as

\[ \mathsfit{I}= (\mathsfit{H}_1 \lor \mathsfit{H}_2) \land \mathsfit{I}'\]

and then proceed as usual. If the background information was not explicitly written, this may lead to confusion and mistakes: there may suddenly appear two instances of $\mathrm{P}(\mathsfit{X})$ with different values, just because one of them is invisibly conditional on $\mathsfit{I}$, the other on $\mathsfit{I}'$.
In some inference problems we are considering several different instances of background information – for example because more than one agent is involved. It’s then extremely important to write the background information explicitly, lest we mix up the degrees of belief of different agents.

For the extra curious

A once-famous paper in the quantum-theory literature: Brukner & Zeilinger 2000: Conceptual inadequacy of the Shannon information in quantum measurements arrived at completely wrong results simply by omitting background information, mixing up probabilities having different conditionals.

This kind of confusion from poor notation happens more often than one thinks, and even appears in the scientific literature.

“Random variables”

Some texts speak of the probability of a “random variable”, or more precisely of the probability “that a random variable takes on a particular value”. As you notice, we have just expressed that idea by means of a sentence. The viewpoint and terminology of random variables is therefore a special case of that based on sentences, which we use here.

The dialect of “random variables” does not offer any advantages in concepts, notation, terminology, or calculations, and it has several shortcomings:

James Clerk Maxwell is one of the main founders of statistical mechanics and kinetic theory (and electromagnetism). Yet he never used the word “random” in his technical writings. Maxwell is known for being very clear and meticulous with explanations and terminology.

As discussed in § 8.1, in concrete applications it is important to know how a quantity “takes on” a value: for example it could be directly measured, indirectly reported, or purposely set to that specific value. Thinking and working in terms of sentences, rather than of random variables, allows us to account for these important differences.
We want a general AI agent to be able to deal with uncertainty and probability also in situations that do not involve mathematical sets.
Very often the object (proposal) of a probability is not a “variable”: it is actually a constant value that is simply unknown (simple example: we are uncertain about the mass of a particular block of concrete, so we speak of the probability of some mass value; this doesn’t mean that the mass of the block of concrete is changing).
What does “random” (or “chance”) mean? Good luck finding an understandable and non-circular definition in texts that use that word. Strangely enough, texts that use that word never define it. In these notes, if the word “random” is ever used, it stands for “unpredictable” or “unsystematic”.

It’s a question for sociology of science why some people keep on using less flexible points of view or terminologies. Probably they just memorize them as students and then a fossilization process sets in.

Finally, some texts speak of the probability of an “event”. For all purposes, an “event” is just what’s expressed in a sentence.