17 Conditional probability and learning – ADA511 [0.3]{.small .grey} <br>Data Science and AI prototyping

17.1 The meaning of the term “conditional probability”

When we introduced the notion of degree of belief – a.k.a. probability – in chapter 8, we emphasized that every probability is conditional on some state of knowledge or information. So the term “conditional probability” sounds like a pleonasm, just like saying “round circle”.

This term must be understood in a way analogous to “marginal probability”: it applies in situations where we have two or more sentences of interest. We speak of a “conditional probability” when we want to emphasize that additional sentences appear in the conditional (right side of “$\nonscript\:\vert\nonscript\:\mathopen{}$”) of that probability. For instance, in a scenario with these two probabilities:

\[ \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{204,187,68}B} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \qquad \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \]

we call the first conditional probability of $\mathsfit{A}$ (given $\mathsfit{\color[RGB]{204,187,68}B}$) to emphasize or point out that its conditional includes the additional sentence $\mathsfit{\color[RGB]{204,187,68}B}$, whereas the conditional of the second probability doesn’t include this sentence.

17.2 The relation between learning and conditional probability

Why do we need to emphasize that a particular degree of belief is conditional on an additional sentence? Because the additional sentence usually represents new information that the agent has learned.

Remember that the conditional of a probability usually contains all factual information known to the agent¹. Therefore if an agent acquires new data or a new piece of information expressed by a sentence $\color[RGB]{204,187,68}\mathsfit{D}$, it should draw inferences and make decisions using probabilities that include $\color[RGB]{204,187,68}\mathsfit{D}$ in their conditional. In other words, the agent before was drawing inferences and making decisions using some probabilities

¹ Exceptions are, for instance, when the agent does counterfactual or hypothetical reasoning, as we discussed in § 5.1.

\[ \mathrm{P}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{K}) \]

where $\mathsfit{K}$ is the agent’s knowledge until then. Now that the agent has acquired information or data $\color[RGB]{204,187,68}\mathsfit{D}$, it will draw inferences and make decisions using probabilities

\[ \mathrm{P}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{204,187,68}D} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) \]

Vice versa, if we see that an agent is calculating new probabilities conditional on an additional sentence $\color[RGB]{204,187,68}\mathsfit{D}$, then it means² that the agent has acquired that information or data $\color[RGB]{204,187,68}\mathsfit{D}$.

² But keep again in mind exceptions like counterfactual reasoning; see the previous side note.

Therefore conditional probabilities represent an agent’s learning and should be used when an agent has learned something.

This learning can be of many different kinds. Let’s examine two particular kinds by means of some examples.

17.3 Learning about a quantity from a different quantity

Consider once more the next-patient arrival scenario of § 15.2, with joint quantity $(U,T)$ and an agent’s joint probability distribution as in table 15.1, reproduced here:

Joint probability distribution for transportation and urgency
$\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}t\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}_{\text{H}})$		transportation at arrival $T$
		ambulance	helicopter	other
urgency $U$	urgent	0.11	0.04	0.03
urgency $U$	non-urgent	0.17	0.01	0.64

Suppose that the agent must forecast whether the next patient will require ${\small\verb;urgent;}$ or ${\small\verb;non-urgent;}$ care, so it needs to calculate the probability distribution for $U$ (that is, the probabilities for $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$ and $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}$).

In the first exercise of § 16.1 you found that the marginal probability that the next patient will need urgent care is

\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = 18\%\]

this is the agent’s degree of belief if it has nothing more and nothing less than the knowledge encoded in the sentence $\mathsfit{I}_{\text{H}}$.

But now let’s imagine that the agent receives a new piece of information: it is told that the next patient is being transported by helicopter. In other words, the agent has learned that the sentence $T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}$ is true. The agent’s complete knowledge is therefore now encoded in the anded sentence

\[T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\ \land\ \mathsfit{I}_{\text{H}}\]

and this composite sentence should appear in the conditional. The agent’s belief that the next patient requires urgent care, given the new information, is therefore

\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}})\]

Calculation of this probability can be done by just one application of the and-rule, leading to a formula connected with Bayes’s theorem (§ 9.4):

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \cdot \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) \\[3ex] &\quad\implies\quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) = \frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \end{aligned} \]

Let’s see how to calculate this. The agent already has the joint probability for $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\land T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}$ that appears in the numerator of the fraction above. The probability in the denominator is just a marginal probability for $T$, and we know how to calculate that too from § 16.1. So we find

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) =\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \sum_u\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \]

where it’s understood that the sum index $u$ runs over the values $\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}$.

This is called a conditional probability; in this case, the conditional probability of $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$ given $T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}$.

The collection of probabilities for all possible values of the quantity $U$, given a specific value of the quantity $T$, say ${\small\verb;helicopter;}$:

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \qquad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \]

is called the conditional probability distribution for $U$ given $T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}$. It is indeed a probability distribution because the two probabilities sum up to 1.

Note that the collection of probabilities for, say, $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$, but for different values of the conditional quantity $T$, that is:

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \\[1ex] &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \\[1ex] &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \end{aligned} \]

is not a probability distribution. Calculate the three probabilities above and check that in fact they do not sum up to 1.

Exercise

Using the values from table 15.1 and the formula for marginal probabilities, calculate:
- The conditional probability that the next patient needs urgent care, given that the patient is being transported by helicopter.
- The conditional probability that the next patient is being transported by helicopter, given that the patient needs urgent care.
Now discuss and find an intuitive explanation for these comparisons:
- The two probabilities you obtained above. Are they equal? why or why not?
- The marginal probability that the next patient will be transported by helicopter, with the conditional probability that the patient will be transported by helicopter given that it’s urgent. Are they equal? if not, which is higher, and why?

17.4 Learning about a quantity from instances of similar quantities

In the previous section we examined how learning about one quantity can change an agent’s degree of belief about a different quantity, for example knowledge about “transportation” affects beliefs about “urgency”, or vice versa. The agent’s learning and ensuing belief change are reflected in the value of the corresponding conditional probability.

This kind of change can also occur with “similar” quantities, that is, quantities that represent the same kind of phenomenon and have the same domain. The maths and calculations are identical to the ones we explored above, but the interpretation and application can be somewhat different.

As an example, imagine a scenario similar to the next-patient arrival above, but now consider the next three patients to arrive and their urgency. Define the following three quantities:

$U_1$ : urgency of the next patient
$U_2$ : urgency of the second future patient from now
$U_3$ : urgency of the third future patient from now

Each of these quantities has the same domain: $\set{{\small\verb;urgent;},{\small\verb;non-urgent;}}$.

The joint quantity $(U_1, U_2, U_3)$ has a domain with $2^3 = 8$ possible values:

$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$
$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}$
. . .
$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$
$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}$

Suppose that an agent, with background information $\mathsfit{I}$, has a particular joint belief distribution for the joint quantity $(U_1, U_2, U_3)$. For example consider the joint distribution implicitly given as follows:

If ${\small\verb;urgent;}$ appears in the probability 0 times out of 3: probability = $53.6\%$
If ${\small\verb;urgent;}$ appears 1 times out of 3: probability = $11.4\%$
If ${\small\verb;urgent;}$ appears 2 times out of 3: probability = $3.6\%$
If ${\small\verb;urgent;}$ appears 3 times out of 3: probability = $1.4\%$

Here are some examples of how the probability values are determined by the description above:

\[ \begin{aligned} &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 \quad&&\text{\small(${\small\verb;urgent;}$ appears twice)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.114 &&\text{\small(${\small\verb;urgent;}$ appears once)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 &&\text{\small(${\small\verb;urgent;}$ appears twice)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.536 &&\text{\small(${\small\verb;urgent;}$ doesn't appear)} \end{aligned} \]

Exercise

Check that the joint probability distribution as defined above indeed sums up to $1$.
Calculate the marginal probability for $U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$, that is, $\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})$.
Calculate the marginal probability that the second and third patients are non-urgent cases, that is

\[\mathrm{P}(U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) \ .\]

From this joint probability distribution the agent can calculate, among other things, its degree of belief that the third patient will require urgent care, regardless of the urgency of the preceding two patients. It’s the marginal probability

\[ \begin{aligned} \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) &= \sum_{u_1}\sum_{u_2} \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_1 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_2 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \\[1ex] &= 0.114 + 0.036 + 0.036 + 0.014 \\[1ex] &= \boldsymbol{20.0\%} \end{aligned} \]

where each index $u_1$ and $u_2$ runs over the values $\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}$. This double sum therefore involves four terms. The first term in the sum corresponds to “$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$” and therefore has probability $0.014$ . The second term corresponds to “$U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$” and therefore has probability $0.036$. And so on.

Therefore the agent, with its current knowledge, has a $20\%$ degree of belief that the third patient will require urgent care.

Now fast-forward in time, after two patients have arrived and have been taken good care of; or maybe they haven’t arrived yet, but their urgency conditions have been ascertained and communicated to the agent. Suppose that both patients were or are non-urgent cases. The agent now knows this fact. The agent needs to forecast whether the third patient will require urgent care.

The relevant degree of belief is obviously not $\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})$, calculated above, because this belief represents an agent knowing only $\mathsfit{I}$. Now, instead, the agent has additional information about the first two patients, encoded in this anded sentence:

\[ U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;} \]

The relevant degree of belief is therefore the conditional probability

\[ \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \]

Which we can calculate with the same procedure as in the previous section:

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.114}{0.65} \\[2ex] &\qquad{}\approx \boldsymbol{17.5\%} \end{aligned} \]

This conditional probability $17.5\%$ for $U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}$ is lower than $20.0\%$ calculated previously, which was based only on knowledge $\mathsfit{I}$. Learning about the two first patients has thus affected the agent’s degree of belief about the third.

Let’s also check how the agent’s belief changes in the case where the first two patients are both urgent instead. The calculation is completely analogous:

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.030}{0.107} \\[2ex] &\qquad{}\approx \boldsymbol{28.0\%} \end{aligned} \]

In this case the conditional probability $28.0\%$ for $U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$ is higher than the $20.0\%$, which was based only on knowledge $\mathsfit{I}$.

One possible intuitive explanation of these probability changes, in the present scenario, is that observation of two non-urgent cases makes the agent slightly more confident that “this is a day with few urgent cases”. Whereas observation of two urgent cases makes the agent more confident that “this is a day with many urgent cases”.

The diversity of inference scenarios

In general we cannot say that the probability of a particular value (such as ${\small\verb;urgent;}$ in the scenario above) will decrease or increase as similar or dissimilar values are observed. Nor can we say how much the increase or decrease will be.

In a different situation the probability of ${\small\verb;urgent;}$ could actually increase as more and more ${\small\verb;non-urgent;}$ cases are observed. Imagine, for instance, a scenario where the agent initially knows that there are 10 urgent and 90 non-urgent cases ahead (maybe these 100 patients have already been gathered in a room). Having observed 90 non-urgent cases, the agent will give a much higher, in fact 100%, probability that the next case will be an urgent one. Can you see intuitively why this conditional degree of belief must be 100%?

The differences among scenarios are reflected in differences in joint probabilities, from which the conditional probabilities are calculated. One particular joint probability can correspond to a scenario where observation of a value increases the degree of belief in subsequent instances of that value. Another particular joint probability can instead correspond to a scenario where observation of a value decreases the degree of belief in subsequent instances of that value.

All these situations are, in any case, correctly handled with the four fundamental rules of inference and the formula for conditional probability derived from them!

Exercises

Using the same joint distribution above, calculate

\[\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\]

that is, the probability that the first patient will require urgent care given that the agent knows the second and third patients will not require urgent care.
- Why is the value you obtained different from $\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})$ ?
- Describe a scenario in which the conditional probability above makes sense, and patients 2 and 3 still arrive after patient 1. That is, a scenario where the agent learns that patients 2 and 3 are non-urgent, but still doesn’t know the condition of patient 1.

Do an analysis completely analogous to the one above, but with different background information $\mathsfit{J}$ corresponding to the following joint probability distribution for $(U_1, U_2, U_3)$:

• If ${\small\verb;urgent;}$ appears 0 times out of 3:  probability = $0\%$
• If ${\small\verb;urgent;}$ appears 1 times out of 3:  probability = $24.5\%$
• If ${\small\verb;urgent;}$ appears 2 times out of 3:  probability = $7.8\%$
• If ${\small\verb;urgent;}$ appears 3 times out of 3:  probability = $3.1\%$
1. Calculate
  
  \[\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{J})\]
  
  and
  
  \[\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{J})\] and compare them.
2. Find a scenario for which this particular change in degree of belief makes sense.

17.5 Learning in the general case

Take the time to review the two sections above, focusing on the application and meaning of the two scenarios and calculations, and noting the similarities and differences:

The calculations were completely analogous. In particular, the conditional probability was obtained as the quotient of a joint probability and a marginal one.
In the first (urgency & transportation) scenario, information about one aspect of the situation changed the agent’s belief about another aspect. The two aspects were different (transportation and urgency). Whereas in the second (three-patient) scenario, information about analogous occurrences of an aspect of the situation changed the agent’s belief about a further occurrence.

A third scenario is also possible, which combines the two above. Consider the case with three patients, where each patient can require ${\small\verb;urgent;}$ care or not, and can be transported by ${\small\verb;ambulance;}$, ${\small\verb;helicopter;}$, or ${\small\verb;other;}$ means. To describe this situation, introduce three pairs of quantities, which together form the joint quantity

\[ (U_1, T_1, \ U_2, T_2, \ U_3, T_3) \]

whose symbols should be obvious. This joint quantity has $(2\cdot 3)^3 = 216$ possible values, corresponding to all urgency & transportation combinations for the three patients.

Given the joint probability distribution for this joint quantity, it is possible to calculate all kinds of conditional probabilities, and therefore consider all the possible ways the agent may learn new information. For instance, suppose the agent learns this:

the first two patients have not required urgent care
the first patient was transported by ambulance
the second patient was transported by other means
the third patient is arriving by ambulance

and with this learned knowledge, the agent needs to infer whether the third patient will require urgent care. The required conditional probability is

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) } \end{aligned} \]

and is calculated in a way completely analogous to the ones already seen.

All three kinds of inference scenarios that we have discussed occur in data science and engineering. In machine learning, the second scenario is connected to “unsupervised learning”; the third, mixed scenario to “supervised learning”. As you just saw, the probability calculus “sees” all of these scenarios as analogous: information about something changes the agent’s belief about something else. And the handling of all three cases is perfectly covered by the four fundamental rules of inference.

So let’s write down the general formula for all these cases of learning.

Let’s consider a more generic case of a joint quantity with component quantities $\color[RGB]{34,136,51}X$ and $\color[RGB]{238,102,119}Y$. Their joint probability distribution is given. Each of these two quantities could be a complicated joint quantity by itself.

The conditional probability for $\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y$, given that the agent has learned that $\color[RGB]{34,136,51}X$ has some specific value $\color[RGB]{34,136,51}x^*$, is then

\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon}\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon}\mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.1}\]

where the index $\color[RGB]{238,102,119}\upsilon$ runs over all possible values in the domain of $\color[RGB]{238,102,119}Y$.

17.6 Conditional probabilities as initial information

Up to now we have calculated conditional probabilities, using the derived formula (17.1), starting from the joint probability distribution, which we considered to be given. In some situations, however, an agent may initially possess not a joint probability distribution but conditional probabilities together with marginal probabilities.

As an example let’s consider a variation of our next-patient scenario one more time. The agent has background information $\mathsfit{I}_{\text{S}}$ that provides the following set of probabilities:

Two conditional probability distributions $\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})$ for transportation $T$ given urgency $U$, as reported in the following table:

Table 17.1: Probability distributions for transportation given urgency

$\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}t \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})$		transportation at arrival $T\nonscript\:\vert\nonscript\:\mathopen{}{}$
		ambulance	helicopter	other
given urgency ${}\nonscript\:\vert\nonscript\:\mathopen{}U$	urgent	0.61	0.22	0.17
	non-urgent	0.21	0.01	0.78

This table has two probability distributions: on the first row, one conditional on $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}$; on the second row, one conditional on $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}$. Check that the probabilities on each row indeed sum up to 1.

Marginal probability distribution $\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}})$ for urgency $U$:

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.18 \ , \quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.82 \tag{17.2}\]

With this background information, the agent can also compute all joint probabilities simply using the and-rule. For instance, the joint probability for $U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}$ is

\[ \begin{aligned} &P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 = \boldsymbol{3.96\%} \end{aligned} \]

And from the joint probabilities, the marginal ones for transportation $T$ can also be calculated. For instance

\[ \begin{aligned} &P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 + 0.01 \cdot 0.82 \\[1ex] &\quad{}= \boldsymbol{4.78\%} \end{aligned} \]

Now suppose that the agent learns that the next patient is being transported by ${\small\verb;helicopter;}$, and needs to forecast whether ${\small\verb;urgent;}$ care will be needed. This inference is the conditional probability $\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})$, which can also be rewritten in terms of the conditional probabilities given initially:

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \\[2ex] &\quad{}=\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \\[1ex] &\quad{}=\frac{ P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) }{ \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) } \\[1ex] &\quad{}=\frac{0.0396}{0.0478} \\[2ex] &\quad{}=\boldsymbol{82.8\%} \end{aligned} \]

This calculation was slightly more involved than the one in § 17.3, because in the present case the joint probabilities were not directly available. Our calculation involved the steps $T\nonscript\:\vert\nonscript\:\mathopen{}U \enspace\longrightarrow\enspace T\land U \enspace\longrightarrow\enspace U\nonscript\:\vert\nonscript\:\mathopen{}T$ .

In this same scenario, note that if the agent were instead interested, say, in forecasting the transportation means knowing that the next patient requires urgent care, then the relevant degree of belief $\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})$ would be immediately available and no calculations would be needed.

Let’s find the general formula for this case, where the agent’s background information is represented by conditional probabilities instead of joint probabilities.

Consider a joint quantity with component quantities $\color[RGB]{34,136,51}X$ and $\color[RGB]{238,102,119}Y$. The conditional probabilities $\mathrm{P}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})$ and $\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})$ are encoded in the agent from the start.

The conditional probability for $\color[RGB]{238,102,119}Y \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y$, given that the agent has learned that $\color[RGB]{34,136,51}X \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*$, is then

\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon} \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.3}\]

In the above formula we recognize Bayes’s theorem from § 9.4.

This formula is often exaggeratedly emphasized in the literature; some texts even present it as an “axiom” to be used in situations such as the present one. But we see that this formula is simply a by-product of the four fundamental rules of inference in a specific situation. An AI agent who knows the four fundamental inference rules, and doesn’t know what “Bayes’s theorem” is, will nevertheless arrive at this very formula.

17.7 Conditional densities

The discussion so far about conditional probabilities extends to conditional probability densities, in the usual way explained in §§15.3 and 16.2.

If $\color[RGB]{34,136,51}X$ and $\color[RGB]{238,102,119}Y$ are continuous quantities, the notation

\[ \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = {\color[RGB]{68,119,170}q} \]

means that, given background information $\mathsfit{I}$ and given the sentence “$\color[RGB]{34,136,51}X$ has value between $\color[RGB]{34,136,51}x-\delta/2$ and $\color[RGB]{34,136,51}x+\delta/2$”, the sentence “$\color[RGB]{238,102,119}Y$ has value between $\color[RGB]{238,102,119}y-\epsilon/2$ and $\color[RGB]{238,102,119}y+\epsilon/2$” has probability ${\color[RGB]{68,119,170}q}\cdot{\color[RGB]{238,102,119}\epsilon}$, as long as $\color[RGB]{34,136,51}\delta$ and $\color[RGB]{238,102,119}\epsilon$ are small enough. Note that the small interval $\color[RGB]{34,136,51}\delta$ for $\color[RGB]{34,136,51}X$ is not multiplied by the density $\color[RGB]{68,119,170}q$.

The relation between a conditional density and a joint density or a different conditional density is given by

\[ \begin{aligned} &\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon}\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon} \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \end{aligned} \]

where $\color[RGB]{238,102,119}\varUpsilon$ is the domain of $\color[RGB]{238,102,119}Y$.

17.8 Graphical representation of conditional probability distributions and densities

Conditional probability distributions and densities can be plotted in all the ways discussed in chapters 15 and 16. If we have two quantities $A$ and $B$, often we want to compare the different conditional probability distributions for $A$, conditional on different values of $B$:

$\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;one-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})$,
$\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;another-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})$,
$\dotsc$

and so on. This can be achieved by representing them by overlapping line plots, or side-by-side scatter plots, or similar ways.

In § 16.3 we saw that if we have the scatter plot for a joint probability density, then from its points we can often obtain a scatter plot for its marginal densities. Unfortunately no similar advantage exists for the conditional densities that can be obtained from a joint density. In theory, a conditional density for $Y$, given that a quantity $X$ has value in some small interval $\delta$ around $x$, could be obtained by only considering scatter-plot points having $X$ coordinate in a small interval between $x-\delta/2$ and $x+\delta/2$. But the number of such points is usually too small and the resulting scatter plot could be very misleading.

Study reading

§5.4 of Risk Assessment and Decision Analysis with Bayesian Networks
§§12.2.1, 12.3, and 12.5 of Artificial Intelligence
§§4.1–4.3 in Medical Decision Making
§§5.1–5.5 of Probability – yes, once more!