In most engineering and data-science problems we don’t know the truth or falsity of outcomes and hypotheses that interest us. But this doesn’t mean that nothing can be said or done in such situations. Now we shall finally see how to draw uncertain inferences, that is, how to calculate the probability of something that interests us, given particular data, information, and assumptions.
So far we have used the term “probability” somewhat informally and intuitively. It is time to make it more precise and to emphasize some of its most important aspects. Then we’ll dive into the rules of probability-inference.
8.1 When truth isn’t known: probability
When we cross a busy city street we look left and right to check whether any cars are approaching. We typically don’t look up to check whether something is falling from the sky. Yet, couldn’t it be false that cars are approaching at that moment? and couldn’t it be true that some object is falling from the sky? Of course both events are possible. Then why do we look left and right, but not up?
The main reason is that we believe strongly that cars might be approaching, and believe very weakly that some object might be falling from the sky. In other words, we consider the first occurrence to be very probable, and the second extremely improbable.
We shall take the notion of probability as intuitively understood (just as we did with the notion of truth). Terms equivalent to “probability” are degree of belief, plausibility, credibility1, certainty.
1credibility literally means “believability” (from Latin credo = to believe).
Probabilities are quantified between \(0\) and \(1\), or equivalently between \(0\%\) and \(100\%\). Assigning to a sentence a probability 1 is the same as saying that it is true; and a probability 0, that it is false. A probability of \(0.5\) represents a belief completely symmetric with respect to truth and falsity.
Alternatively, if an agent assigns to a sentence a probability 1, it means that the agent is completely certain that the sentence is true. If the agent assigns a probability 0, it means that the agent is completely certain that the sentence is false. If the agent assigns a probability 0.5, it means that the agent is equally uncertain about the truth as about the falsity of the sentence.
Let’s emphasize and agree on some important facts about probabilities:
Probabilities are assigned to sentences. We already discussed this point in § 6.3, but let’s reiterate it. Consider an engineer working on a problem of electric-power distribution in a specific geographical region. At a given moment the engineer may believe with \(75\%\) probability that the measured average power output in the next hour will be 100 MW. The \(75\%\) probability is assigned not to the quantity “100 MW”, but to the sentence
\[
\textsf{\small`The measured average power output in the next hour will be 100\,MW'}
\]
This difference is extremely important. Consider the alternative sentence
\[
\textsf{\small`The average power output in the next hour will be \emph{set} to 100\,MW'}
\]
the numerical quantity is the same, but the meaning is very different. The probability can therefore be very different. If the engineer is the person who decides how to set that output, and has decided to set it to 100 MW, then the probability is obviously \(100\%\) (or very close to), because the engineer already knows what the output will be. The probability depends not only on a number, but on what it’s being done with that number: measuring, setting, third-party reporting, and so on. Often we write simply “\(O \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}10\,\mathrm{W}\)”, provided that the full sentence behind this shorthand is understood.
Probabilities are agent- and knowledge-dependent. A coin is tossed, comes down heads, and is quickly hidden from view. Alice sees that it landed heads-up. Bob instead doesn’t manage to see the outcome and has no clue. Alice considers the sentence \(\textsf{\small`Coin came down heads'}\) to be true, that is, to have \(100\%\) probability. Bob considers the same sentence to have \(50\%\) probability.
Note how Alice and Bob assign two different probabilities to the same sentence; yet both assignments are completely rational. If Bob assigned \(100\%\) to \(\textsf{\small`heads'}\), we would suspect that he had seen the outcome after all. If he assigned \(0\%\) to \(\textsf{\small`heads'}\), we would consider it unreasonable (he didn’t see the outcome, so why exclude \(\textsf{\small`heads'}\)?). At the same time we would be baffled if Alice assigned only \(50\%\) to \(\textsf{\small`heads'}\), because she actually saw that the outcome was heads; maybe we would wonder whether she feels unsure about what she saw.
An omniscient agent would know the truth or falsity of every sentence, and assign only probabilities 0 or 1. Some authors speak of “actual (but unknown) probabilities”. But if there were “actual” probabilities, they would be all 0 or 1, and it would be pointless to speak about probabilities at all – every inference would be a truth-inference.
Probabilities are not frequencies. Consider the fraction of defective mechanical components to total components produced per year in some factory. This quantity can be physically measured and, once measured, would be agreed upon by every agent. It is a frequency, not a degree of belief or probability.
It is important to understand the difference between probability and frequency: mixing them up may lead to sub-optimal decisions. Later we shall say more about the difference and the precise relations between probability and frequency.
Frequencies can be unknown to some agents. Probabilities cannot be “unknown”: they can only be difficult to calculate. Be careful when you read authors speaking of an “unknown probability”: they actually mean either “unknown frequency”, or a probability that has to be calculated (it’s “unknown” in the same sense that the value of \(1-0.7 \cdot 0.2/(1-0.3)\) is “unknown” to you right now).
Probabilities are not physical properties. Whether a tossed coin lands heads up or tails up is fully determined by the initial conditions (position, orientation, momentum, rotational momentum) of the toss and the boundary conditions (air velocity and pressure) during the flight. The same is true for all macroscopic engineering phenomena (even quantum phenomena have never been proved to be non-deterministic, and there are deterministic and experimentally consistent mathematical representations of quantum theory). So we cannot measure a probability using some physical apparatus; and the mechanisms underlying any engineering problem boil down to physical laws, not to probabilities.
These points listed above are not just a matter of principle. They have important practical consequences. A data scientist who is not attentive to the source of the data (measured? set? reported, and so maybe less trustworthy?), or who does not carefully assess the context of a probability, or who mixes a probability with a frequency, or who does not take advantage (when possible) of the physics involved in the a problem – such data scientist will design systems with sub-optimal performance2 – or even cause deaths.
2 This fact can be mathematically proven.
8.2 The many uses of the word “probability”
In these notes we shall consistently use the term “probability” in the sense explained above. But beware that this term is used in many different and incompatible senses, depending on whom you’re speaking with or which literature you’re reading.
Some people use this term in the sense of “frequency”: the number of times something happened in a series of repetitions. But a frequency is an objective, measurable quantity; it doesn’t depend on the knowledge of an agent. So it is not useful in building an AI that can reason about uncertainty, because it doesn’t quantify the belief or certainty of an agent. Suppose a coin is tossed 100 times, and it comes up heads 80 times. The frequency of heads is 80/100. Now suppose that an agent who has no knowledge about the 100 tosses and their outcomes is asked to predict whether a toss of the same coin will be heads or tails. What should the agent’s degree of belief or of certainty be? Obviously it should be 50%/50%. If we were to program the agent so that it has a degree of belief of 80% for heads in situations where nothing is know about previous tosses (because that’s the situation our agent was in), then such an agent would on average make poor inferences and decisions in dealing with new coins.
But frequency is data, and if a frequency is known, then obviously an agent should take it into account in quantifying a credibility. If an agent knows that the coin came up heads 80 times in 100, then it is reasonable that the agent’s degree of belief for the next toss should be around 80% for heads. And we shall see that this is indeed what happens.
So the distinction between “frequency” and “probability” is crucial. Frequencies do not enable us to quantify an agent’s beliefs in situations where data are missing. But an agent’s beliefs will automatically come close to frequencies – when doing so is reasonable.
Some people use “probability” in the sense of the number of a-priori successes over the number of possibilities. For instance, if you roll a die, and the die comes up , then this result was 1 among six possible results. This is fine, but this definition does not allow us to capture the difference between an agent who has seen that the outcome of the roll was , and an agent who has not seen that outcome. We expect the two agents to behave differently, because they have different knowledge. If we programmed the agent to have a degree of belief of 1/6 even after seeing the outcome of a die roll, we would have programmed an agent incapable of using new data. Would you be happy if a clinician made some medical tests, and then ignored the results of the tests, behaving as if they were still unknown?
All these different uses are just a matter of semantics, and in the end it doesn’t matter which word we use, as long as we understand its meaning and as long as we’re adopting the meaning that is useful for our present task.
Beware of likelihood as a synonym for probability
In everyday language, “likelihood” is synonym with “probability”. In technical writings about probability or statistics, however, “likelihood” means something different and is not a synonym of “probability”, as we explain below (§ 8.8.1).
8.3 An unsure inference. Probability notation
Consider now the following variation of the trivial inference problem of § 7.1.
This electric component had an early failure. If an electric component fails early, then at production either it didn’t pass the heating test or it didn’t pass the shock test. The probability that it passed neither test (that is, both tests failed) is 10%. There’s no reason to believe that the component passed the heating test, more than to believe that it passed the shock test.
Again the inspector wants to assess whether the component did not pass the heating test.
From the data and information given, what would you say is the probability that the component didn’t pass the heating test?
Exercise 1
Try to argue why a conclusion cannot be drawn with certainty in this case. One way to argue this is by presenting two different scenarios that fit the given data but have opposite conclusions.
Try to reason intuitively and assess the probability that the component didn’t pass the heating test. Should it be larger or smaller than 50%? Why?
For this inference problem we cannot find a true or false final value. The truth-inference rules (7.1)–(7.4) therefore cannot help us here. In fact even the “\(\mathrm{T}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \dotso)\)” notation is unsuitable, because it only admits the values \(1\) (true) and \(0\) (false).
Let us first generalize this notation in a straightforward way:
First, let’s represent the probability or degree of belief of a sentence by a number in the range \([0,1]\), that is, between \(\mathbf{1}\) (certainty or true) and \(\mathbf{0}\) (impossibility or false). The value \(0.5\) represents a belief in the truth of the sentence which is as strong as the belief in its falsity.
Second, let’s symbolically write in the following way that the probability of a proposal \(\mathsfit{Y}\), given a conditional \(\mathsfit{X}\), is some number \(p\):
\[
\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = p
\]
Note that this notation includes the notation for truth-values as a special case:
\[
\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1
\quad\Longleftrightarrow\quad
\mathrm{T}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{X}) = 0\text{ or }1
\]
8.4 Inference rules
Extending our truth-inference notation to probability-inference notation has been straightforward. But which rules should we use for drawing inferences when probabilities are involved?
The amazing result is that the rules for truth-inference, formulae (7.1)–(7.4), extend also to probability-inference. The only difference is that they now hold for all values in the range \([0,1]\), rather than only for \(0\) and \(1\).
This important result was taken more or less for granted at least since Laplace in the 1700s, but was formally proven for the first time in 1946 by R. T. Cox. The proof has been refined since then. What kind of proof is it? It shows that if we don’t follow the rules we are doomed to arrive at illogical conclusions; we’ll show some examples later.
Finally, here are the fundamental rules of all inference. They are encoded by the following equations, which must always hold for any atomic or composite sentences \(\mathsfit{X},\mathsfit{Y},\mathsfit{Z}\):
How to use the rules:
Each equality can be rewritten in different ways according to the usual rules of algebra. Then the resulting left side can be replaced by the right side, and vice versa. The numerical values of starting inferences can be replaced in the corresponding expressions.
It is amazing that ALL inference is nothing else but a repeated application of these four rules – maybe billions of times or more. All machine-learning algorithms are just applications or approximations of these rules. Methods that you may have heard about in statistics are just specific applications of these rules. Truth inferences are also special applications of these rules. Most of this course is just a study of how to apply these rules to particular kinds of problems.
The fundamental inference rules are used in the same way as their truth-inference counterpart of [§@truth-inference-rules]: Each equality can be rewritten in different ways according to the usual rules of algebra. The left and right side of the equality thus obtained can replace each other in a proof.
8.5 Solution of the uncertain-inference example
Armed with the fundamental rules of inference, let’s solve our earlier inference problem. As usual, we first analyse it and represent it in terms of atomic sentences; we find what are its proposal and conditional; and we find which initial inferences are given in the problem.
(1) Atomic sentences
\[
\begin{aligned}
\mathsfit{h}&\coloneqq\textsf{\small`The component passed the heating test'}
\\
\mathsfit{s}&\coloneqq\textsf{\small`The component passed the shock test'}
\\
\mathsfit{f}&\coloneqq\textsf{\small`The component had an early failure'}
\\
\mathsfit{J}&\coloneqq\textsf{\small (all other implicit background information)}
\end{aligned}
\]
The background information in this example is different from the previous, truth-inference one, so we use the different symbol \(\mathsfit{J}\) for it.
The conditional is different now. We know that the component failed early, but we don’t know whether it passed the shock test. Hence the conditional is \(\mathsfit{f}\land \mathsfit{J}\).
We are told that if an electric component fails early, then at production it didn’t pass the heating test or the shock test (or neither). This is given as a sure fact. Let’s write it as
Finally the problem says that there’s no reason to believe that the component didn’t pass the heating test, more than it didn’t pass the shock test. This can be written as follows:
Note the interesting situation above: we are not given the numerical values of these two probabilities; we are only told that they are equal. This is an example of application of the principle of indifference, which we’ll discuss more in detail later.
Finding the target inference
Also in this case there is no unique way of applying the rules to reach our target inference, but all paths will lead to the same result. Let’s try to proceed backwards:
So the probability that the component didn’t pass the heating test is \(55\%\).
Exercise 2
Try to find an intuitive explanation of why the probability is 55%, slightly larger than 50%. If your intuition says this probability is wrong, then:
Check the proof of the inference for mistakes, or try to find a proof with a different path.
Examine your intuition critically and educate it.
Check how the target probability \(\mathrm{P}(\lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})\) changes if we change the value of the probability \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})\) from \(0.1\).
What result do we obtain if \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=0\)? Can it be intuitively explained?
What if \(\mathrm{P}(\lnot\mathsfit{s}\land \lnot\mathsfit{h}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{f}\land \mathsfit{J})=1\)? Does the result make sense?
8.6 Use and implementation of the inference rules
In the step-wise solution above you noticed that the equations of the fundamental rules were not only used to calculate the probability on their left side, given those on their right side. For example, in the very first step, when we went from the probability
The four rules in fact represent, first of all, constraints of logical consistency (the precise technical term is coherence) among probabilities. For instance, if we have probabilities
then there must be an inconsistency the agent’s degrees of belief, because these values violate the and-rule:
\[0.2 \ne 0.1 \cdot 0.7 \,.\]
When this happens, the inconsistency must be found and solved. (However, since probabilities are quantified by real numbers, it’s possible and acceptable to have slight discrepancies within numerical round-off errors.)
The fundamental rules also imply more general constraints. For example we must always have
Try to prove the two constraints above from the four fundamental rules.
Translate the two constraints above verbally. Do they make intuitive sense? (For example, the first constraint says, very roughly speaking, that our belief in two things can’t be stronger than our belief in either one of the two things alone.)
In following the step-wise solution you may have been asking yourself after some steps: “OK, why this peculiar step now, and not some other step?”. This is a very intelligent question. You essentially noticed the lack of an algorithm. Compare this situation with the solution of a basic decision problem: there we had a clear sequence of steps to do: writing down some tables, do some element-wise multiplications and then some sums, and finally find the largest of a list of numbers.
Is there an algorithmic way of drawing inferences, that is, of calculating some target probabilities given some initial ones? If our inference problem involves a finite number of sentences, then the answer is yes, but with some caveats.
Given a set of initial probabilities, and a target probability whose value we want to find, there is an algorithm that yields the minimum and the maximum possible values of the target probability. Specifically it can yield the following results (note that some are particular instances of others):
For the extra curious
The complete proof that inference problems can be solved this way seems to have been given first by Hailperin:
although particular cases were explored already by Boole.
No values
This means that the initial probabilities have inconsistent values; that is, they violate some of the four rules, as in the “\(0.2 \ne 0.1 \cdot 0.7\)” example above.
\(0\) and \(1\)
That is, the target probability is completely unspecified. This means that the initial probabilities are not sufficient to determine the target probability. For example, if we have \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.2\), and \(\mathsfit{Y}\) is a sentence completely unrelated to \(\mathsfit{X}\) (no atomic sentences in common), then \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z})\) could have any value.
Values \(v\), \(V\) with \(v < V\) and at least one different from 0 or 1
That is, the target probability cannot have any value whatsoever, but a its value is still unspecified. This means that the initial probabilities constrain the target probability somewhat, but do not determine it. The two constraints discussed above are an example of this: if we have \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.2\) and \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) = 0.3\), the we can say that \(0 \le \mathrm{P}(\mathsfit{X}\land \mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{Z}) \le 0.2\), but its precise value is not determined.
One value \(p\)
The target probability is completely determined by the initial ones. This occurred in our example inference about the electronic component.
This algorithm is not mathematically difficult, but it is somewhat involved. It boils down to: (1) writing the sentences in the initial and target probabilities in disjunctive normal form; (2) rewriting the initial and target probabilities as sum of probabilities of basic conjunctions; (3) solving two linear-fractional optimization problems, equivalent to linear optimization oens (for which there are algorithms available).
You can find an implementation of the algorithm above in the R function findP(), which requires the lpSolve R package. This function takes as first argument the probability we want to find, and as remaining arguments the values of the given probabilities, or equalities among probabilities. We must use the following notation:
\(\lnot\) becomes !
\(\land\) becomes & or equivalently &&
\(\lor\) becomes | (sadly) or equivalently ||
\(=\) becomes ==
conditional bar \(\nonscript\:\vert\nonscript\:\mathopen{}\) becomes ~
The output are the minimum and maximum possible values of the target probability, or NA if the starting probabilities are inconsistent.
We can apply this function in the example above about the electronic component:
library('lpSolve')source('code/findP.R') # load the functionfindP(## target probability:P(!h ~ f & J),## initial probabilities:P(!h ||!s ~ f & J) ==1,P(!h &!s ~ f & J) ==0.1,P(h ~ f & J) ==P(s ~ f & J))
min max
0.55 0.55
Obviously the algorithm becomes more expensive, the larger the number of sentences in the inference problem. In inference problems involving continuous quantities, such as energy, such an algorithm cannot be applied in practice. Also, the way the algorithm above works cannot be represented as a sequence of “logical” steps with consecutive applications of the rules, as shown in the step-wise solution above. Later on in this course we shall focus on specific kinds of inferences for which other, less opaque inference algorithms are available.
Exercise 3
Play with the function findP(): test it in other inference problems, find problems where there is no solution and others where the min and max values are different.
What does this result tells us about “combining evidence” to prove a hypothesis?
8.7 Consequences of not following the rules
The fundamental rules of inference guarantee that the agent’s uncertain reasoning is self-consistent, and that it follows logic when there’s no uncertainty. Breaking the rules means that the resulting inference has some logical or irrational inconsistencies.
There are many examples of inconsistencies that appear when the rules are broken. Imagine for instance an agent that gives an 80% probability that it rains3 in the next hour; and it also gives a 90% probability that it rains and that the average wind is above 3⋅m/s in the next hour. This is clearly unreasonable, because the raining scenario alone would be true with wind above 3 m/s and also below 3⋅m/s – therefore it should be more probable than the scenario where the wind is above 3 m/s. And indeed the two given probabilities break the and-rule, showing that they are unreasonable or illogical.
3 to be precise, let’s say “it rains above 1 mm”
Exercise 4
Prove that the two probabilities in the example above break the and-rule. (Hint: you must use the fact that probabilities are numbers between 0 and 1, and that multiplying a number by something between 0 and 1 can only yield a smaller number.)
As you continue your studies, skim through chapters 4–8 of Hastie & Dawes: Rational Choice in an Uncertain World, just to get the main messages and an overview of curious psychological phenomena.
8.8 Remarks on terminology and notation
Likelihood
In everyday language, “likely” is often a synonym of “probable”, and “likelihood” of “probability”. But in technical writings about probability, inference, and decision-making, “likelihood” has a very different meaning. Beware of this important difference in definitions:
\(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\) is:
the probability of \(\mathsfit{Y}\) given \(\mathsfit{X}\) (or conditional on \(\mathsfit{X}\)),
the likelihood of \(\mathsfit{X}\) in view of \(\mathsfit{Y}\).
We can also say:
the probability of \(\mathsfit{Y}\) given \(\mathsfit{X}\), is \(\mathrm{P}({\color[RGB]{68,119,170}\mathsfit{Y}}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\).
the likelihood of \(\mathsfit{Y}\) in view of \(\mathsfit{X}\), is \(\mathrm{P}(\mathsfit{X}\nonscript\:\vert\nonscript\:\mathopen{}{\color[RGB]{68,119,170}\mathsfit{Y}})\).
A priori there is no relation between the probability and the likelihood of a sentence \(\mathsfit{Y}\): this sentence could have very high probability and very low likelihood, and vice versa.
In these notes we’ll avoid the possibly confusing term “likelihood”. All we need to express can be phrased in terms of probability.
Omitting background information
In the analyses of the inference examples of § 7.1 and § 8.3 we defined sentences (\(\mathsfit{I}\) and \(\mathsfit{J}\)) expressing all background information, and always included these sentences in the conditionals of the inferences – because those inferences obviously depended on that specific background information.
In many concrete inference problems the background information usually stays in the conditional from beginning to end, while the other sentences jump around between conditional and proposal as we apply the rules of inference. For this reason the background information is often omitted from the notation, being implicitly understood. For instance, if the background information is denoted \(\mathsfit{I}\), one writes
“\(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X})\)” instead of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{X}\land \mathsfit{I})\)
“\(\mathrm{P}(\mathsfit{Y})\)” instead of \(\mathrm{P}(\mathsfit{Y}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\)
This is what’s happening in books where you see “\(P(x)\)” without conditional.
Such practice may be convenient, but be wary of it, especially in particular situations:
In some inference problems we suddenly realize that we must distinguish between cases that depend on hypotheses, say \(\mathsfit{H}_1\) and \(\mathsfit{H}_2\), that were buried in the background information \(\mathsfit{I}\). If the background information \(\mathsfit{I}\) is explicitly reported in the notation, this is no problem: we can rewrite it as
and then proceed as usual. If the background information was not explicitly written, this may lead to confusion and mistakes: there may suddenly appear two instances of \(\mathrm{P}(\mathsfit{X})\) with different values, just because one of them is invisibly conditional on \(\mathsfit{I}\), the other on \(\mathsfit{I}'\).
In some inference problems we are considering several different instances of background information – for example because more than one agent is involved. It’s then extremely important to write the background information explicitly, lest we mix up the degrees of belief of different agents.
This kind of confusion from poor notation happens more often than one thinks, and even appears in the scientific literature.
“Random variables”
Some texts speak of the probability of a “random variable”, or more precisely of the probability “that a random variable takes on a particular value”. As you notice, we have just expressed that idea by means of a sentence. The viewpoint and terminology of random variables is therefore a special case of that based on sentences, which we use here.
The dialect of “random variables” does not offer any advantages in concepts, notation, terminology, or calculations, and it has several shortcomings:
James Clerk Maxwell is one of the main founders of statistical mechanics and kinetic theory (and electromagnetism). Yet he never used the word “random” in his technical writings. Maxwell is known for being very clear and meticulous with explanations and terminology.
As discussed in § 8.1, in concrete applications it is important to know how a quantity “takes on” a value: for example it could be directly measured, indirectly reported, or purposely set to that specific value. Thinking and working in terms of sentences, rather than of random variables, allows us to account for these important differences.
We want a general AI agent to be able to deal with uncertainty and probability also in situations that do not involve mathematical sets.
Very often the object (proposal) of a probability is not a “variable”: it is actually a constant value that is simply unknown (simple example: we are uncertain about the mass of a particular block of concrete, so we speak of the probability of some mass value; this doesn’t mean that the mass of the block of concrete is changing).
What does “random” (or “chance”) mean? Good luck finding an understandable and non-circular definition in texts that use that word. Strangely enough, texts that use that word never define it. In these notes, if the word “random” is ever used, it stands for “unpredictable” or “unsystematic”.
It’s a question for sociology of science why some people keep on using less flexible points of view or terminologies. Probably they just memorize them as students and then a fossilization process sets in.
Finally, some texts speak of the probability of an “event”. For all purposes, an “event” is just what’s expressed in a sentence.