29 Inferences from frequencies

Published

2024-10-30

29.1 If the population frequencies were known

Let’s now see how the exchangeability of an agent’s degrees of belief allows it to calculate probabilities about the units of a population. We shall do this calculation in two steps. First, in the case where the agent knows the joint frequency distribution (§§21.2, 21.3, 23.2) for the full population. Second, in the more general case where the agent lacks this population-frequency information.

When the full-population frequency distribution is known, the calculation of probabilities is very intuitive and analogous to the stereotypical “drawing balls from an urn”. We shall rely on this intuition; keep in mind, however, that the probabilities are not assigned “by intuition”, but actually fully determined by the two basic pieces of knowledge or assumptions: exchangeability and known population frequencies. Some simple proof sketches of this will also be given.

We consider an infinite population with any number of variates. For concreteness we assume these variates to have finite, discrete domains; but the formulae we obtain can be easily generalized to other kinds of variates. In this and the following chapters we shall often use the simplified income dataset (file income_data_nominal_nomissing.csv and its underlying population as an example. This population has nine nominal variates. The variates, their domain sizes, and their possible values are listed at this link.

Notation recap

We shall mainly use the notation introduced in § 27.3:

All population variates, jointly, are denoted \({\color[RGB]{68,119,170}Z}\). In the case of the income dataset, for instance, the variate \({\color[RGB]{68,119,170}Z}\) stands for the joint variate with nine components:

\[ \begin{aligned} {\color[RGB]{68,119,170}Z}&\coloneqq(\color[RGB]{68,119,170} \mathit{workclass} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{education} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{marital\_status} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{occupation} \mathbin{\mkern-0.5mu,\mkern-0.5mu} {}\\ &\qquad \color[RGB]{68,119,170}\mathit{relationship} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{race} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{sex} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{native\_country} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathit{income} \color[RGB]{0,0,0}) \end{aligned} \]

When we write \(\color[RGB]{68,119,170}Z \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z\), the symbol \(\color[RGB]{68,119,170}z\) stands for some definite joint values, for instance \(({\color[RGB]{68,119,170}{\small\verb;Without-pay;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Doctorate;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Ireland;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;>50K;}})\).

In applications where the agent wants to infer the values of some predictand variates, given the observation of predictor variates, the former are denoted \({\color[RGB]{68,119,170}Y}\), the latter \({\color[RGB]{68,119,170}X}\). In the income problem, for instance, the agent (some USA census agency) would like to infer the \(\color[RGB]{68,119,170}\mathit{income}\) variate of a person from the other eight demographic characteristics \(\color[RGB]{68,119,170}\mathit{workclass} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{education} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb\) of that person. So in this inference problem we define

\[ \begin{aligned} {\color[RGB]{68,119,170}Y}&\coloneqq{\color[RGB]{68,119,170}\mathit{income}} \\[1ex] {\color[RGB]{68,119,170}X}&\coloneqq({\color[RGB]{68,119,170}\mathit{workclass} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{education} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{sex} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{native\_country}}) \end{aligned} \]

We shall, however, also consider slightly different inference problems, for example with \(({\color[RGB]{68,119,170}\mathit{race} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{sex}})\) as predictand and the remaining seven variates \(({\color[RGB]{68,119,170}\mathit{workclass} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathit{income}})\) as predictors.

Often we shall use red for quantities that are not known in the problem, and green for quantities that are known.

29.2 Knowing the full-population frequency distribution

Now suppose that the agent knows the full-population joint frequency distribution. Let’s make clearer what this means. In the income problem, for instance, consider these two different joint values for the joint variate \({\color[RGB]{68,119,170}Z}\):

\[ \begin{aligned} {\color[RGB]{68,119,170}z^{*}}&\coloneqq( {\small\verb;Private;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;HS-grad;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Married-civ-spouse;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Machine-op-inspct;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{} \\ &\qquad{\small\verb;Husband;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;White;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Male;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;United-States;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;<=50K;} ) \\[2ex] {\color[RGB]{68,119,170}z^{**}}&\coloneqq( {\small\verb;Self-emp-not-inc;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;HS-grad;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Married-civ-spouse;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{} \\ &\qquad {\small\verb;Farming-fishing;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Husband;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;White;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;Male;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;United-States;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\small\verb;<=50K;} ) \end{aligned} \]

The agent knows that the value \(\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{*}\) occurs in the full population of interest (in this case all 340 millions or so USA citizens, considered in a short period of time) with a relative frequency \(0.860 369\%\); it also knows that the value \(\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{**}\) occurs with a relative frequency \(0.260 058\%\). We write this as follows:

\[ f({\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{*}}) = 0.860 369\% \ , \qquad f({\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{**}}) = 0.260 058\% \]

The agent knows not only the frequencies of the two particular joint values \(\color[RGB]{68,119,170}z^{*}\), \(\color[RGB]{68,119,170}z^{**}\), but for all possible joint values, that is, for all possible combinations of values from the single variate. In the income example there are 54 001 920 possible combinations, and therefore just as many relative frequencies. All these frequencies together form the full-population frequency distribution for \({\color[RGB]{68,119,170}Z}\), which we denote collectively with \(\boldsymbol{f}\) (note the boldface). Let’s introduce the quantity \(F\), denoting the full-population frequency distribution. Knowledge that the frequencies are \(\boldsymbol{f}\) is then expressed by the sentence \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\).

Don’t confuse the full population with a sample from it

Note that the frequencies reported above are not the ones found in the income_data_nominal_nomissing.csv dataset, because that dataset is only a sample from the full population, not the full population. The frequency values reported above are purely hypothetical (but not inconsistent with the frequencies observed in the sample).

In other cases, these hypothetically known frequencies would refer to the full population of units: maybe even past, present, future, if they span a possibly unlimited time range.

29.3 Inference about a single unit

Now imagine that the agent, given the information \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\) about the frequencies and some background information \(\mathsfit{I}\), must infer all \({\color[RGB]{68,119,170}Z}\) variates for a specific unit \(u\). In the income case, it would be an inference about a specific USA citizen. This unit \(u\) could have any particular combination of variate values; in the income case it could have any one of the 54 001 920 possible combined values. The agent must assign a probability to each of these possibilities.¹ Which probability values should it assign?

¹ Remember that this is what we mean when we say “drawing an inference”! (See chap.  5 and § 14.1)

Intuitively we would say that the probability for a particular value \(\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z\) should be equal to the frequency of that value in the full population:

if \(\mathsfit{I}\) leads to an exchangeable probability distribution, then

\[ \mathrm{P}({\color[RGB]{68,119,170}Z}_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z} \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z}) \]

for any unit \(u\).

For instance, the probabilities that unit \(u\) has the values \(\color[RGB]{68,119,170}z^{*}\) or \(\color[RGB]{68,119,170}z^{**}\) above is

\[ \begin{aligned} &\mathrm{P}({\color[RGB]{68,119,170}Z}_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{*}} \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{*}}) = 0.860 369\% \\[1ex] &\mathrm{P}({\color[RGB]{68,119,170}Z}_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}} \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}}) = 0.260 058\% \end{aligned} \]

This intuition is the same as in drawing balls, which may have different sets of labels, from a collection, given that we know the proportion of balls with each possible label set.

But the equality above can actually be proven mathematically in this specific case: it follows from the assumption of exchangeability. Let’s examine a very simple case to get an idea of how this proof works.

Exact calculation of the probabilities in a simple case

Suppose we have three rocks from our Mars-prospecting collection. They are marked #1, #2, #3. They look alike, but we know that two of them have haematite, so \(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\) for them, and one doesn’t, so \(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\) for that rock. This background information – let’s call it \(\mathsfit{K}_{\textsf{3}}\) – is a simple case of a finite population with three units and a binary variate \(R\). We know that the frequency distribution for this population is

\[f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) = 2/3 \qquad f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) = 1/3\]

Our information \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\) about the frequencies corresponds to the following composite sentence:

\[ F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\ \Longleftrightarrow\ (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \]

Given \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\), we know that \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\) is true: \(\mathrm{P}( F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}})=1\), which means

\[ \mathrm{P}\bigl[(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \nonscript\:\big\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}, \mathsfit{K}_{\textsf{3}}\bigr] = 1 \]

Now use the or-rule, considering that the three ored sentences are mutually exclusive:

\[ \begin{aligned} 1&=\mathrm{P}\bigl[(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \lor (R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \nonscript\:\big\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}\bigr] \\[2ex] &= \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) +{} \\&\qquad \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) +{} \\&\qquad \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) \end{aligned} \]

According to our background information \(\mathsfit{K}_{\textsf{3}}\), our degrees of belief are exchangeable. This means that the three probabilities summed up above must all have the same value, because in each of them \({\color[RGB]{102,204,238}{\small\verb;Y;}}\) appears twice and \({\color[RGB]{204,187,68}{\small\verb;N;}}\) once. But if we are summing the same value thrice, and the sum is \(1\), that that value must be \(1/3\). Hence we find that

\[ \begin{aligned} &\mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) = 1/3 \\ &\mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) = 1/3 \\ &\mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) = 1/3 \\[1ex] &\text{all other probabilities are zero} \end{aligned} \]

Now let’s find the probability that a rock, say #1, has haematite (\({\color[RGB]{102,204,238}{\small\verb;Y;}}\)), given that we haven’t observed any other rocks: \(\mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}})\). This is a marginal probability (§ 16.1), so it’s given by the sum

\[ \begin{aligned} \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) &= \sum_{i={\color[RGB]{102,204,238}{\small\verb;Y;}}}^{{\color[RGB]{204,187,68}{\small\verb;N;}}}\sum_{j={\color[RGB]{102,204,238}{\small\verb;Y;}}}^{{\color[RGB]{204,187,68}{\small\verb;N;}}} \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}i \mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}j \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) \\[1ex] &= \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) + {} \\ &\qquad \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) + {} \\ &\qquad \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) + {} \\ &\qquad \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) \\[1ex] &= 0 + 1/3 + 1/3 + 0 \\[1ex] &= 2/3 \end{aligned} \]

which is indeed equal to \(f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}})\).

This simple example gives you an idea why our intuition for equating – in specific circumstances – probability with full-population frequencies, is actually a mathematical theorem: it follows from (1) knowledge of the full-population frequencies, and (2) exchangeability.

Exercises

Calculate \(\mathrm{P}(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}})\), that is, the probability that rock #2 has haematite, given that we don’t know the haematite content of any other rock. Is it different from \(\mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}})\), or not? Why?
Build a similar proof for a slightly different case; for example four rocks; or two units from a population with a variate having three possible values (instead of just the two \(\set{{\color[RGB]{102,204,238}{\small\verb;Y;}},{\color[RGB]{204,187,68}{\small\verb;N;}}}\)).
Consider the same calculation we did above, but in the case of background knowledge \(\mathsfit{K}_{\text{NE}}\) where our degrees of belief are \(\text{N}\)ot \(\text{E}\)xchangeable. For instance, give three different values to the probabilities

\[ \begin{gathered} \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\text{NE}}) \\ \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\text{NE}}) \\ \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu} R_{3}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\text{NE}}) \end{gathered} \]

in such a way that they still sum up to \(1\). Then find by marginalization the probability that rock #1 contains haematite (\({\color[RGB]{102,204,238}{\small\verb;Y;}}\)). Is this probability still equal to the relative frequency of \({\color[RGB]{102,204,238}{\small\verb;Y;}}\)?

29.4 Inference about several units

Let’s continue with the Mars-prospecting example of the previous section, with just three rocks. We found that the probability that rock #1 has haematite (\({\color[RGB]{102,204,238}{\small\verb;Y;}}\)) was \(2/3\), given that we haven’t observed any other rocks. This probability was equal to the frequency of \({\color[RGB]{102,204,238}{\small\verb;Y;}}\)-rocks in the urn.

Now suppose that we observe rock #2, and it turns out to have haematite (\(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\)). What is the probability that rock #1 has haematite?

The probability we are asking about is \(\mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}})\), and it can be calculated with the usual rules. The result is again the same as the frequency of the \({\color[RGB]{102,204,238}{\small\verb;Y;}}\)-rocks, but with respect to the new situation: there are now two rocks left in front of us, and one must contain haematite, while the other doesn’t. The probability is therefore \(1/2\), a value different from that we found before, \(2/3\):

\[ \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) = 2/3 \qquad \mathrm{P}(R_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}_{\textsf{3}}) = 1/2 \]

This situation is quite general: in a collection of many rocks, the probabilities for new observations change accordingly to information about previous observations (and also subsequent ones, if already known).

But consider now the case \(\mathsfit{K}\) of a large collection of 3 000 000 rocks, 2 000 000 of which have haematite and the rest doesn’t.² The population’s relative frequencies are exactly as in the case with three rocks, and for the probability that rock #1 contains haematite we still have

² Note how this scenario becomes very similar to that of coin tosses.

\[ \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) = f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) = \frac{2 000 000}{3 000 000} = 2/3 \]

Now suppose we examine rock #2 and find haematite. What is the probability that rock #1 also contains haematite? In this case we find

\[ \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) = \frac{1 999 999}{2 999 999} \approx 2/3 \]

with an absolute error of only \(0.000 000 1\). That is, the probability and frequency are almost the same as before examining rock #2. The reason is clear: the number of rocks is so large that observing some of them doesn’t practically change the content and proportions of the whole collection.

The joint probability that rock #2 contains haematite and rock #1 doesn’t is therefore, by the and-rule,

\[ \begin{aligned} \mathrm{P}(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) &= \mathrm{P}(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) \cdot \mathrm{P}(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) \\[1ex] &\approx f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \cdot f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \end{aligned} \]

the approximation being the better, the larger the collection of rocks.

It is easy to see that this will hold for more observations, and for different and more complex variate domains, as long as the number of units considered is enough small compared with the population size. For instance

\[ \mathrm{P}(R_4\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) \approx f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \cdot f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) \cdot f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}) \cdot f(R\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}) \]

where \(\boldsymbol{f}\) is the initial frequency distribution for the population.

This situation applies to more general populations: if the full-population frequencies are known, the agent’s beliefs are exchangeable, and the population is practically infinite, then the joint probability that some units have a particular set of values is equal to the product of the frequencies of those values.

If an agent:

has background information \(\mathsfit{I}\) about a population saying that
- beliefs about units are exchangeable
- the population size is practically infinite
has full information \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\) about the population frequencies \(\boldsymbol{f}\) for the variate \({\color[RGB]{68,119,170}Z}\)

then

\[ \mathrm{P}( {\color[RGB]{68,119,170}Z}_{u'}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z'} \mathbin{\mkern-0.5mu,\mkern-0.5mu} {\color[RGB]{68,119,170}Z}_{u''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z''} \mathbin{\mkern-0.5mu,\mkern-0.5mu} {\color[RGB]{68,119,170}Z}_{u'''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z'''} \mathbin{\mkern-0.5mu,\mkern-0.5mu} \dotsb \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \approx f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z'}) \cdot f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z''}) \cdot f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z'''}) \cdot \dotsb \]

for any (different) units \(u', u'', u''', \dotsc\) and any (even equal) values \(\color[RGB]{68,119,170}z', z'', z''', \dotsc\).

The formula above solves our initial problem: How to calculate and encode the joint probability distribution for the full population?, although it does so only in the case where the full-population frequencies \(\boldsymbol{f}\) are known. In this case this probability is encoded in the \(\boldsymbol{f}\) itself (which can be represented as a multidimensional array), and can be calculated for any desired joint probability distribution just by a multiplication.

In the income example from § 29.2, the probability that two units (citizens) #\(a\), #\(c\) have value \(\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{**}\) and one unit #\(b\) has value \(\color[RGB]{68,119,170}Z \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z^{*}\) is

\[ \begin{aligned} \mathrm{P}( {\color[RGB]{68,119,170}Z}_{a}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}} \mathbin{\mkern-0.5mu,\mkern-0.5mu} {\color[RGB]{68,119,170}Z}_{b}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{*}} \mathbin{\mkern-0.5mu,\mkern-0.5mu} {\color[RGB]{68,119,170}Z}_{c}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}} \nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) &\approx f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}}) \cdot f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{*}}) \cdot f({\color[RGB]{68,119,170}Z}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{68,119,170}z^{**}}) \\[1ex] &= 0.260 058\% \cdot 0.860 369\% \cdot 0.260 058\% \\ &= 0.000 005 818 7\% \end{aligned} \]

Always check whether the approximate formula above is appropriate

As we have seen, the product formula above is strictly speaking only approximate. In situations where the full population has practically infinite size compared to (1) the number of units that the agent uses for learning, and (2) the number of units the agent will draw inferences about, then the formula can be used as if it were exact.

But how much is “practically infinite”? No general answer is possible: it depends on the precision required in the specific problem. In some problems, even if learning and inference involve 10% of the units from the full population, the approximation might still be acceptable; but in other problems it might not be. If learning and inference involve 50% or more units from the full population, then the formula above is hardly acceptable.

The probability calculus and the four fundamental rules allow us to handle problems with any population size exactly (see the Study reading below), but the exact computation becomes involved and expensive. This is why the approximate product formula above is valuable, when it can be properly used.

29.5 No learning when full-population frequencies are known

Imagine an agent with exchangeable beliefs \(\mathsfit{I}\) and knowledge \(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\) of the full-population frequencies, who also has observed several units with values (possibly some identical) \(\color[RGB]{68,119,170}Z_{u'}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z' \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{u''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'' \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{u'''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''' \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb\). What is this agent’s degree of belief that a new unit #\(u\) has value \(\color[RGB]{68,119,170}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z\)?

From our basic formula for this question,

\[ \begin{aligned} &\mathrm{P}(\color[RGB]{238,102,119} Z_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{}\color[RGB]{34,136,51} Z_{u'}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u'''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''' \mathbin{\mkern-0.5mu,\mkern-0.5mu} \dotsb \color[RGB]{0,0,0}\mathbin{\mkern-0.5mu,\mkern-0.5mu}F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}=\frac{ \mathrm{P}(\color[RGB]{238,102,119} Z_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \color[RGB]{0,0,0}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \color[RGB]{34,136,51}Z_{u'}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u'''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''' \mathbin{\mkern-0.5mu,\mkern-0.5mu} \dotsb \color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) }{ \sum_{\color[RGB]{170,51,119}z} \mathrm{P}( \color[RGB]{238,102,119}Z_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\color[RGB]{170,51,119}z \color[RGB]{0,0,0}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \color[RGB]{34,136,51}Z_{u'}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'' \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{u'''}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''' \mathbin{\mkern-0.5mu,\mkern-0.5mu} \dotsb \color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) } \\[2ex] &\qquad{}\approx\frac{ f({\color[RGB]{238,102,119}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z}) \cdot f({\color[RGB]{68,119,170}{\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'}}) \cdot f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''}) \cdot f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'''}) \cdot \dotsb }{ \sum_{\color[RGB]{170,51,119}z} f({\color[RGB]{238,102,119}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\color[RGB]{170,51,119}z}) \cdot f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'}) \cdot f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''}) \cdot f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'''}) \cdot \dotsb } \\[2ex] &\qquad{}=\frac{ f({\color[RGB]{238,102,119}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z}) \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'})} \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''})} \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'''})} \cdot \cancel{\dotsb} }{ \underbracket[0.2ex]{\sum_{\color[RGB]{170,51,119}z} f({\color[RGB]{238,102,119}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\color[RGB]{170,51,119}z})}_{{}=1} \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'})} \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z''})} \cdot \cancel{f({\color[RGB]{34,136,51}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z'''})} \cdot \cancel{\dotsb} } \\[2ex] &\qquad{}= f({\color[RGB]{238,102,119}Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z}) \\[3ex] &\qquad{}\equiv \mathrm{P}(\color[RGB]{238,102,119}Z_{u}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{} F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \end{aligned} \]

so the information from the units \(u'\), \(u''\), and so on is irrelevant to this agent. In other words, this agent’s inferences about some units are not affected by the observation of other units.

The reason for this irrelevance is that the agent already knows the full-population frequencies. So the observation of the frequencies of some values provides no new information to the agent.

Obviously this is not what we desire. But it is not a problem: the crucial point is that knowledge of full-population frequencies is only a hypothetical, idealized situation. In the next chapter we shall see that learning occurs when we go beyond this idealization.

“Learning” about what?

In this and the following sections, and sometimes in the following chapters, when we say “the agent is learning” or “the agent is not learning” we mean specifically the change in an agent’s beliefs about observation of variates of some units which had not yet been observed.

Note that there is always learning about something whenever we put new information in the conditional of a probability. In the Mars-prospecting example above, for example, we have

\[ \mathrm{P}(R_{1} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{K}) = 2/3 \qquad \mathrm{P}(R_{1} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_{2} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) = 2/3 \]

and the agent has (practically) not learned anything about the sentence \(R_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\) from the sentence \(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\).

But we also have

\[ \mathrm{P}(R_{2} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{K}) = 2/3 \qquad \mathrm{P}(R_{2} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\nonscript\:\vert\nonscript\:\mathopen{} R_{2} \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{204,187,68}{\small\verb;N;}}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) = 0 \]

the probability for the sentence \(R_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{102,204,238}{\small\verb;Y;}}\) has changed. So the agent has learned something: that rock #2 doesn’t contain haematite (\({\color[RGB]{204,187,68}{\small\verb;N;}}\)).

Study reading

Ch. 3 of Probability Theory. This chapter is extremely instructive in general to understand how probability theory works.