8  Probability inference

Published

2024-02-02

In most engineering and data-science problems we don’t know the truth or falsity of outcomes and hypotheses that interest us. But this doesn’t mean that nothing can be said or done in such situations. Now we shall finally see how to draw uncertain inferences, that is, how to calculate the probability of something that interests us, given particular data, information, and assumptions.

So far we have used the term “probability” somewhat informally and intuitively. It is time to make it more precise and to emphasize some of its most important aspects. Then we’ll dive into the rules of probability-inference.

8.1 When truth isn’t known: probability

When we cross a busy city street we look left and right to check whether any cars are approaching. We typically don’t look up to check whether something is falling from the sky. Yet, couldn’t it be false that cars are approaching at that moment? and couldn’t it be true that some object is falling from the sky? Of course both events are possible. Then why do we look left and right, but not up?

The main reason is that we believe strongly that cars might be approaching, and believe very weakly that some object might be falling from the sky. In other words, we consider the first occurrence to be very probable, and the second extremely improbable.

We shall take the notion of probability as intuitively understood (just as we did with the notion of truth). Terms equivalent to “probability” are degree of belief, plausibility, credibility1.

1 credibility literally means “believability” (from Latin credo = to believe).

Beware of likelihood as a synonym for probability

In everyday language, “likelihood” is synonym with “probability”. In technical writings about probability or statistics, however, “likelihood” means something different and is not a synonym of “probability”, as we explain below (§  8.8.1).

Probabilities are quantified between \(0\) and \(1\), or equivalently between \(0\%\) and \(100\%\). Assigning to a sentence a probability 1 is the same as saying that it is true; and a probability 0, that it is false. A probability of \(0.5\) represents a belief completely symmetric with respect to truth and falsity.

Let’s emphasize and agree on some important facts about probabilities:

  • Probabilities are assigned to sentences. We already discussed this point in §  6.3, but let’s reiterate it. Consider an engineer working on a problem of electric-power distribution in a specific geographical region. At a given moment the engineer may believe with \(75\%\) probability that the measured average power output in the next hour will be 100 MW. The \(75\%\) probability is assigned not to the quantity “100 MW”, but to the sentence

    \[ \textsf{\small`The measured average power output in the next hour will be 100\,MW'} \]

    This difference is extremely important. Consider the alternative sentence

    \[ \textsf{\small`The average power output in the next hour will be \emph{set} to 100\,MW'} \]

    the numerical quantity is the same, but the meaning is very different. The probability can therefore be very different (if the engineer is the person deciding how to set that output, then the probability is \(100\%\), because has decided and knows what the output is). The probability depends not only on a number, but on what it’s being done with that number – measuring, setting, third-party reporting, and so on. Often we write simply \(O \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}10\,\mathrm{W}\) provided that the full sentence behind this kind of shorthand is understood.

  • Probabilities are agent- and knowledge-dependent. A coin is tossed, comes down heads, and is quickly hidden from view. Alice sees that it landed heads-up. Bob instead doesn’t manage to see the outcome and has no clue. Alice considers the sentence \(\textsf{\small`Coin came down heads'}\) to be true, that is, to have \(100\%\) probability. Bob considers the same sentence to have \(50\%\) probability.

    Note how Alice and Bob assign two different probabilities to the same sentence; yet both assignments are completely rational. If Bob assigned \(100\%\) to \(\textsf{\small`heads'}\), we would suspect that he had seen the outcome after all; if he assigned \(0\%\) to \(\textsf{\small`heads'}\), we would consider that unreasonable (he didn’t see the outcome, so why exclude \(\textsf{\small`heads'}\)?). At the same time we would be baffled if Alice assigned only \(50\%\) to \(\textsf{\small`heads'}\), because she actually saw that the outcome was heads; maybe we would hypothesize that she feels unsure about what she saw.

    An omniscient agent would know the truth or falsity of every sentence, and assign only probabilities 0 or 1. Some authors speak of “actual (but unknown) probabilities”. But if there were “actual” probabilities, they would be all 0 or 1, and it would be pointless to speak about probabilities at all – every inference would be a truth-inference.

  • Probabilities are not frequencies. Consider the fraction of defective mechanical components to total components produced per year in some factory. This quantity can be physically measured and, once measured, would be agreed upon by every agent. It is a frequency, not a degree of belief or probability.

    It is important to understand the difference between probability and frequency: mixing them up may lead to sub-optimal decisions. Later we shall say more about the difference and the precise relations between probability and frequency.

    Frequencies can be unknown to some agents. Probabilities cannot be “unknown”: they can only be difficult to calculate. Be careful when you read authors speaking of an “unknown probability”: they actually mean either “unknown frequency”, or a probability that has to be calculated (it’s “unknown” in the same sense that the value of \(1-0.7 \cdot 0.2/(1-0.3)\) is “unknown” to you right now).

  • Probabilities are not physical properties. Whether a tossed coin lands heads up or tails up is fully determined by the initial conditions (position, orientation, momentum, rotational momentum) of the toss and the boundary conditions (air velocity and pressure) during the flight. The same is true for all macroscopic engineering phenomena (even quantum phenomena have never been proved to be non-deterministic, and there are deterministic and experimentally consistent mathematical representations of quantum theory). So we cannot measure a probability using some physical apparatus; and the mechanisms underlying any engineering problem boil down to physical laws, not to probabilities.


These points listed above are not just a matter of principle. They have important practical consequences. A data scientist who is not attentive to the source of the data (measured? set? reported, and so maybe less trustworthy?), or who does not carefully assess the context of a probability, or who mixes a probability with a frequency, or who does not take advantage (when possible) of the physics involved in the a problem – such data scientist will design systems with sub-optimal performance2 – or even cause deaths.

2 This fact can be mathematically proven.

8.2 An unsure inference

Consider now the following variation of the trivial inference problem of §  7.1.

This electric component had an early failure. If an electric component fails early, then at production it either didn’t pass the heating test or didn’t pass the shock test. The probability that it didn’t pass both tests is 10%. There’s no reason to believe that the component passed the heating test, more than it passed the shock test.

The inspector wants to assess, also in this case, whether the component did not pass the heating test.

From the data and information given, what would you say is the probability that the component didn’t pass the heating test?

Exercises
  • Try to argue why a conclusion cannot be drawn with certainty in this case. One way to argue this is to present two different scenarios that fit the given data but have opposite conclusions.

  • Try to reason intuitively and assess the probability that the component didn’t pass the heating test. Should it be larger or smaller than 50%? Why?

8.3 Probability notation

For this inference problem we can’t find a true or false final value. The truth-inference rules (7.1)–(7.4) therefore cannot help us here. In fact even the \(\mathrm{T}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \dotso)\) notation is unsuitable, because it only admits the values \(1\) (true) and \(0\) (false).

Let us first generalize this notation in a straightforward way:

First, let’s represent the probability or degree of belief of a sentence by a number in the range \([0,1]\), that is, between \(\mathbf{1}\) (certainty or true) and \(\mathbf{0}\) (impossibility or false). The value \(0.5\) represents that the belief in the truth of the sentence is as strong as that in its falsity.

Second, let’s symbolically write that the probability of a proposal \(\mathsfit{Y}\), given a conditional \(\mathsfit{X}\), is some number \(p\), as follows:

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = p \]

Note that this notation includes the notation for truth-values as a special case:

\[ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1 \quad\Longleftrightarrow\quad \mathrm{T}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1 \]

8.4 Inference rules

Extending our truth-inference notation to probability-inference notation has been straightforward. But which rules should we use for drawing inferences when probabilities are involved?

The amazing result is that the rules for truth-inference, formulae (7.1)–(7.4), extend also to probability-inference. The only difference is that they now hold for all values in the range \([0,1]\), rather than for \(0\) and \(1\) only.

This important result was taken more or less for granted at least since Laplace in the 1700s. But was formally proven for the first time in 1946 by R. T. Cox. The proof has been refined since then. What kind of proof is it? It shows that if we don’t follow the rules we are doomed to arrive at illogical conclusions; we’ll show some examples later.


Finally, here are the fundamental rules of all inference. They are encoded by the following equations, which must always hold for any atomic or composite sentences \(\mathsfit{X},\mathsfit{Y},\mathsfit{Z}\):

THE FUNDAMENTAL LAWS OF INFERENCE
\(\boldsymbol{\lnot}\) “Not” rule
\[\mathrm{P}(\lnot \mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 1\]
\(\boldsymbol{\land}\) “And” rule
\[ \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Y}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) \cdot \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]
\(\boldsymbol{\lor}\) “Or” rule
\[\mathrm{P}(\mathsfit{X}\lor \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = \mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) + \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) - \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \]
Truth rule
\[\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}\land \mathsfit{Z}) = 1 \]
Note

How to use the rules: Each equality can be rewritten in different ways according to the usual rules of algebra. Then the resulting left side can be replaced by the right side, and vice versa. The numerical values of starting inferences can be replaced in the corresponding expressions.

It is amazing that ALL inference is nothing else but a repeated application of these four rules – billions of times or more in some cases. All machine-learning algorithms are just applications or approximations of these rules. Methods that you may have heard about in statistics are just specific applications of these rules. Truth inferences are also special applications of these rules. Most of this course is, at bottom, just a study of how to apply these rules in particular kinds of problems.

Study reading

 

The fundamental inference rules are used in the same way as their truth-inference special case: Each equality can be rewritten in different ways according to the usual rules of algebra. Then left and right side of the equality thus obtained can replace each other in a proof.

8.5 Solution of the uncertain-inference example

Armed with the fundamental rules of inference, let’s solve our earlier inference problem. As usual we first analyse it, find what are its proposal and conditional, and which starting inferences are given in the problem.

Atomic sentences

\[ \begin{aligned} \mathsfit{h}&\coloneqq\textsf{\small`The component passed the heating test'} \\ \mathsfit{s}&\coloneqq\textsf{\small`The component passed the shock test'} \\ \mathsfit{f}&\coloneqq\textsf{\small`The component had an early failure'} \\ \mathsfit{J}&\coloneqq\textsf{\small (all other implicit background information)} \end{aligned} \]

The background information in this example is different from the previous, truth-inference one, so we use the different symbol \(\mathsfit{J}\) for it.

Proposal, conditional, and target inference

The proposal is \(\lnot\mathsfit{h}\), just like in the truth-inference example.

The conditional is different now. We know that the component failed early, but we don’t know whether it passed the shock test. Hence the conditional is \(\mathsfit{f}\land \mathsfit{J}\).

The target inference is therefore

\[ \color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) \]

Starting inferences

We are told that if an electric component fails early, then at production it didn’t pass either the heating test or the shock test. Let’s write this as

\[ \color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{h}\lor \lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = 1 \]

We are also told that there is a \(10\%\) probability that both tests fail

\[ \color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{h}\land \lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = 0.1 \]

Finally the problem says that there’s no reason to believe that the component didn’t pass the heating test, more than it didn’t pass the shock test. This can be written as follows:

\[ \color[RGB]{34,136,51}\mathrm{P}(\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) = \mathrm{P}(\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) \]

Note this interesting situation: we are not given the numerical values of these two probabilities, we are only told that they are equal. This is an example of application of the principle of indifference, which we’ll discuss more in detail later.

Final inference

Also in this case there is no unique way of applying the rules to reach our target inference, but all ways lead to the same result. Let’s try to proceed backwards:

\[ \begin{aligned} &\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})&& \\[1ex] &\qquad= {\color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{s}\lor \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} + {\color[RGB]{34,136,51}\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} - \mathrm{P}(\lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) &&\text{\small ∨-rule} \\[1ex] &\qquad= {\color[RGB]{34,136,51}1} + {\color[RGB]{34,136,51}0.1} - \mathrm{P}(\lnot\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) &&\text{\small starting inferences} \\[1ex] &\qquad= 0.1 + \color[RGB]{34,136,51}\mathrm{P}(\mathsfit{s}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) &&\text{\small ¬-rule} \\[1ex] &\qquad= 0.1 + \mathrm{P}(\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) &&\text{\small starting inference} \\[1ex] &\qquad= 0.1 + 1 -\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J}) &&\text{\small ¬-rule} \end{aligned} \]

The target probability appears on the left and right side with opposite signs. We can solve for it:

\[ \begin{aligned} 2\,{\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} &= 0.1 + 1 \\[1ex] {\color[RGB]{238,102,119}\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})} &= 0.55 \end{aligned} \]

So the probability that the component didn’t pass the heating test is \(55\%\).

Exercises
  • Try to find an intuitive explanation of why the probability is 55%, slightly larger than 50%. If your intuition says this probability is wrong, then

    • Check the proof of the inference for mistakes, or try to find a proof with a different path.
    • Examine your intuition critically and educate it.
  • Check how the target probability \(\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})\) changes if we change the value of the probability \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})\) from \(0.1\).

    • What result do we obtain if \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=0\)? Can it be intuitively explained?
    • What if \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=1\)? Does the result make sense?

8.6 How the inference rules are used

In the solution above you noticed that the equations of the fundamental rules are not only used to obtain some of the probabilities appearing in them from the remaining probabilities.

The rules represent, first of all, constraints of logical consistency3 among probabilities. For instance, if we have probabilities  \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X}\land \mathsfit{Z})=0.1\),  \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})=0.7\),  and \(\mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})=0.2\),  then there’s an inconsistency somewhere, because these values violate the and-rule:  \(0.2 \ne 0.1 \cdot 0.7\).  In this case we must find the inconsistency and solve it. However, since probabilities are quantified by real numbers, it’s possible and acceptable to have slight discrepancies within numerical round-off errors.

3 The technical term is coherence.

The rules also imply more general constraints. For example we must always have

\[ \begin{gathered} \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) \le \min\set[\big]{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}),\ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})} \\ \mathrm{P}(\mathsfit{X}\lor \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}) \ge \max \set[\big]{\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z}),\ \mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{Z})} \end{gathered} \]

Exercise

Try to prove the two constraints above.

8.7 Consequences of not following the rules

The fundamental rules of inference guarantee that the agent’s uncertain reasoning is self-consistent, and that it follows logic when there’s no uncertainty. Breaking the rules means that the resulting inference has some logical or irrational inconsistencies.

There are many examples of inconsistencies that appear when the rules are broken. Imagine for instance an agent that gives an 80% probability that it rains (above 1 mm) in the next hour; and it also gives a 90% probability that it rains (above 1 mm) and the average wind is above 3⋅m/s in the next hour. This is clearly unreasonable, because the raining scenario alone would be true with wind above 3 m/s and also below 3⋅m/s – so it should be more probable than the scenario where the wind is above 3 m/s. In fact those two probabilities break the and-rule.

Exercise

Prove that the two probabilities in the example above break the and-rule. (Hint: you must use the fact that probabilities are numbers between 0 and 1, and that multiplying a number by something between 0 and 1 can only yield a smaller number.)

Study reading

8.8 Remarks on terminology and notation

Likelihood

In everyday language, “likely” is often a synonym of “probable”; and “likelihood”, of “probability”. But in technical writings about probability, inference, and decision-making, “likelihood” has a very different meaning. Beware of this important difference in definition:

\(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\) is:

  • the probability of \(\mathsfit{Y}\) given \(\mathsfit{X}\) (or conditional on \(\mathsfit{X}\)),

  • the likelihood of \(\mathsfit{X}\) in view of \(\mathsfit{Y}\).

Let’s express this also in a different way:

  • the probability of \(\mathsfit{Y}\) given \(\mathsfit{X}\), is \(\mathrm{P}({\color[RGB]{68,119,170}\mathsfit{Y}}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\).

  • the likelihood of \(\mathsfit{Y}\) in view of \(\mathsfit{X}\), is \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}{\color[RGB]{68,119,170}\mathsfit{Y}})\).

Important

A priori there is no relation between the probability and the likelihood of a sentence \(\mathsfit{Y}\): this sentence could have very high probability and very low likelihood, and vice versa.

In these notes we’ll avoid the possibly confusing term “likelihood”. All we need to express can be phrased in terms of probability.

Omitting background information

In the analyses of the inference examples of §  7.1 and §  8.2 we defined sentences (\(\mathsfit{I}\) and \(\mathsfit{J}\)) expressing all background information, and always included these sentences in the conditionals of the inferences – because those inferences obviously depended on that background information.

In many concrete inference problems the background information usually stays there in the conditional from beginning to end, while the other sentences jump around between conditional and proposal as we apply the rules of inference. For this reason the background information is often omitted from the notation, being implicitly understood. For instance, if the background information is denoted \(\mathsfit{I}\), one writes

  • \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\)  instead of  \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X}\land \mathsfit{I})\)

  • \(\mathrm{P}(\mathsfit{Y})\)  instead of  \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\)

This is what’s happening when you see in books probabilities \(P(x)\) without conditional.

Such practice may be convenient, but be wary of it, especially in particular situations:

  • In some inference problems we suddenly realize that we must distinguish between cases that depend on hypotheses, say \(\mathsfit{H}_1\) and \(\mathsfit{H}_2\), that were buried in the background information \(\mathsfit{I}\). If the background information \(\mathsfit{I}\) is explicitly reported in the notation, this is no problem: we can rewrite it as

    \[ \mathsfit{I}= (\mathsfit{H}_1 \lor \mathsfit{H}_2) \land \mathsfit{I}'\]

    and proceed, for example using the rule of extension of the conversation. If the background information was not explicitly written, this may lead to confusion and mistakes. For instance there may suddenly appear two instances of \(\mathrm{P}(\mathsfit{X})\) with different values, just because one of them is invisibly conditional on \(\mathsfit{I}\), the other on \(\mathsfit{I}'\).

  • In some inference problems we are considering several different instances of background information – for example because more than one agent is involved. It’s then extremely important to write the background information explicitly, lest we mix up the different agents’s degrees of belief.

For the extra curious

A once-famous paper published in the quantum-theory literature arrived at completely wrong results simply by omitting background information, mixing up probabilities having different conditionals.

This kind of confusion from poor notation happens more often than one thinks, and even appears in scientific literature.

“Random variables”

Some texts speak of the probability of a “random variable”, or more precisely of the probability “that a random variable takes on a particular value”. As you notice, we have just expressed that idea by means of a sentence. The viewpoint and terminology of random variables is therefore a special case of that based on sentences, which we use here.

The dialect of “random variables” does not offer any advantages in concepts, notation, terminology, or calculations, and it has several shortcomings:

James Clerk Maxwell is one of the main founders of statistical mechanics and kinetic theory (and electromagnetism). Yet he never used the word “random” in his technical writings. Maxwell is known for being very clear and meticulous with explanations and terminology.
  • As discussed in §  8.1, in concrete applications it is important to know how a quantity “takes on” a value: for example it could be directly measured, indirectly reported, or purposely set to that specific value. Thinking and working in terms of sentences, rather than of random variables, allows us to account for these important differences.

  • We want a general AI agent to be able to deal with uncertainty and probability even in situations that do not involve mathematical sets.

  • Very often the object (proposal) of a probability is not a “variable”: it is actually a constant value that is simply unknown (simple example: we are uncertain about the mass of a particular block of concrete, so we speak of the probability of some mass value; this doesn’t mean that the mass of the block is changing).

  • What does “random” (or “chance”) mean? Good luck finding an understandable and non-circular definition in texts that use that word; strangely enough, they never define it. In these notes, if the word “random” is ever used, it stands for “unpredictable” or “unsystematic”.

It’s a question for sociology of science why some people keep on using less flexible points of view or terminologies. Probably they just memorize them as students and then a fossilization process sets in.


Finally, some texts speak of the probability of an “event”. For all purposes an “event” is just what’s expressed in a sentence.