17  Conditional probability and learning

Published

2023-11-07

17.1 Conditional probability: augmenting knowledge

When we introduced the notion of degree of belief – a.k.a. probability – in chapter  8, we stressed the fact that every probability is conditional on some state of knowledge or information. So the term “conditional probability” sounds like a pleonasm, just like saying “round circle”.

This term must be understood in a way analogous to “marginal probability”: it applies in situations where we have two or more sentences of interest. We speak of a “conditional probability” when we want to emphasize that additional sentences appear in the conditional (right side of \(\nonscript\:\vert\nonscript\:\mathopen{}\)) of that probability, as compared to other probabilities. For instance, in a scenario in which these two probabilities appear:

\[ \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{B} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \qquad \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \]

we call the first conditional probability of \(\mathsfit{A}\) (given \(\mathsfit{B}\)) to emphasize or point out that its conditional includes an additional sentence (\(\mathsfit{B}\)), whereas the conditional of the second probability doesn’t.

Such emphasis is important because it also means that the “conditional” probability is based on some additional knowledge, information, or hypothesis with respect to the “non-conditional” one. This has obvious connections with the idea of “learning”. Indeed the calculation of “conditional” probabilities enters in all situations (even if hypothetical or counterfactual, see §  5.1) in which some knowledge is augmented by new knowledge. This can happens in several ways, which we now examine.

17.2 Conditional from joint probability: dissimilar quantities

Consider once more the next-patient arrival scenario of §  15.2, with joint quantity \((U,T)\) and an agent’s joint probability distribution as in table  15.1. Suppose that the agent must forecast whether the next patient will require \({\small\verb;urgent;}\) or \({\small\verb;non-urgent;}\) care, so it needs to calculate the probability distribution for \(U\) (that is, the probabilities for \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\) and \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)).

In the first exercise of §  16.1 you found that the marginal probability that the next patient will need urgent care is

\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = 18\%\]

this is the agent’s degree of belief if it has the knowledge encoded in the sentence \(\mathsfit{I}_{\text{H}}\), nothing more and nothing less.

But now let’s imagine that the agent receives a new piece of information: it is told that the next patient is being transported by helicopter. In other words, the agent now knows that the sentence \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\) is true. The agent’s complete knowledge is then encoded in the anded sentence

\[T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\land \mathsfit{I}_{\text{H}}\]

which should therefore appear in the conditional. The agent’s belief that the next patient requires urgent care is therefore

\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}})\]

Calculation of this probability can be done by just one application of the and-rule:

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \cdot \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) \\[3ex] &\quad\implies\quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) = \frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \end{aligned} \]

We do have the joint probability for \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\land T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\) that appears in the numerator of the fraction above. The probability in the denominator is just a marginal probability for \(T\), and we know how to calculate that too from §  16.1. Finally we find

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) =\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \sum_u\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \]

where it’s understood that the sum index \(u\) runs over the values \(\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}\).

This is called a conditional probability; in this case, the conditional probability of  \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)  given  \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\).

The collection of probabilities for all possible values of the quantity \(U\), given a specific value of the quantity \(T\), say \({\small\verb;helicopter;}\):

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \qquad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \]

is called the conditional probability distribution for \(U\)  given  \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\). It is indeed a probability distribution because the two probabilities sum up to \(1\).

Important

Note that the collection of probabilities for, say, \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\), but for different values of the conditional quantity \(T\), that is

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \\[1ex] &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \\[1ex] &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \end{aligned} \]

is not a probability distribution. Calculate the three probabilities above and check that indeed they do not sum up to one.

Exercise
  • Using the values from table 15.1 and the formula for marginal probabilities, calculate:

    • The conditional probability that the next patient needs urgent care, given that the patient is being transported by helicopter.

    • The conditional probability that the next patient is being transported by helicopter, given that the patient needs urgent care.

  • Now discuss and find an intuitive explanation for these comparisons:

    • The two probabilities you obtained above. Are they equal? why or why not?

    • The marginal probability that the next patient will be transported by helicopter, with the conditional probability that the patient will be transported by helicopter given that it’s urgent. Are they equal? if not, which is higher, and why?


17.3 Conditional from joint probability: similar quantities

In the previous section we examined how knowledge about one quantity of a particular kind can change an agent’s degree of belief about a quantity of a different kind, for example “transportation” about “urgency” or vice versa. This change is reflected in the value of the corresponding conditional probability.

This kind of change can also occur with quantities of a “similar kind”, that is, quantities that represent the same kind of phenomenon and have exactly the same domain. The maths and calculations are identical to those we have explored, but the interpretation and application can be somewhat different.

As an example, imagine a scenario similar to the next-patient one above, but now consider the next three patients to arrive, and their urgency. Define the following three quantities:

\(U_1\) : urgency of the next patient
\(U_2\) : urgency of the second future patient from now
\(U_3\) : urgency of the third future patient from now

every one of these quantities has the same domain: \(\set{{\small\verb;urgent;},{\small\verb;non-urgent;}}\).

The joint quantity \((U_1, U_2, U_3, U_4)\) has a domain with 23 = 8 possible values:

  • \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)
  • \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)
  • . . .
  • \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)
  • \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)

Suppose that an agent, with background information \(\mathsfit{I}\), has a joint probability distribution for the joint quantity \((U_1, U_2, U_3)\); the distribution is implicitly given as follows:

  • If \({\small\verb;urgent;}\) appears 0 times out of 3: probability = \(53.6\%\)
  • If \({\small\verb;urgent;}\) appears 1 times out of 3: probability = \(11.4\%\)
  • If \({\small\verb;urgent;}\) appears 2 times out of 3: probability = \(3.6\%\)
  • If \({\small\verb;urgent;}\) appears 3 times out of 3: probability = \(1.4\%\)

examples:

\[ \begin{aligned} &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.114 \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 \\[1ex] \end{aligned} \]

Exercise
  • Check that the joint probability distribution as defined above indeed sums up to \(1\).

  • Calculate the marginal probability for \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\), that is,  \(\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\).

  • Calculate the marginal probability that the second and third patients are non-urgent cases, that is

\[\mathrm{P}(U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) \ .\]

From this joint probability distribution the agent can calculate, among other things, its degree of belief that the third patient from now will require urgent care, regardless of the urgency of the preceding two patients. It’s the marginal probability

\[ \begin{aligned} \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) &= \sum_{u_1}\sum_{u_2} \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_1 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_2 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \\[1ex] &= 0.114 + 0.036 + 0.036 + 0.014 \\[1ex] &= \boldsymbol{20.0\%} \end{aligned} \]

where the first term \(0.114\) in the sum corresponds to \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\), the second term \(0.036\) to \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\), and so on.

Therefore the agent, right now, has a \(20\%\) degree of belief that the third patient from now will require urgent care.


Now fast-forward in time, after two patients have arrived and been taken good care of. Suppose that both were non-urgent cases, and the agent knows this. The agent needs to forecast whether the next (third) patient will require urgent care.

It wouldn’t be sensible to use  \(\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\),  calculated above, because this degree of belief represents an agent having only the background knowledge \(\mathsfit{I}\). Now, instead, the agent has additional information about the first two patients, encoded in this anded sentence:

\[ U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;} \]

The relevant degree of belief is therefore the conditional probability

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.114}{0.65} \\[2ex] &\qquad{}\approx \boldsymbol{17.5\%} \end{aligned} \]

This conditional probability of \(17.5\%\) for \(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\) is lower than the marginal one \(20.0\%\) calculated previously. Observation of two patients has thus affected the agent’s degree of belief.


Let’s also check how the agent’s belief changes in the case where the first two patients are both urgent. The calculation is completely analogous:

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.030}{0.107} \\[2ex] &\qquad{}\approx \boldsymbol{28.0\%} \end{aligned} \]

In this case the conditional probability \(28.0\%\) for \(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\) is higher than the marginal one \(20.0\%\).

One possible intuitive explanation of these probability changes, in the present scenario, is that observation of two non-urgent cases makes the agent slightly more confident that “this is a day with few urgent cases”; whereas observation of two urgent cases makes the agent more confident that “this is a day with many urgent cases”.

Important

In general we cannot say that the probability of a particular value (such as \({\small\verb;urgent;}\) in the scenario above) will decrease or increase as similar or dissimilar values are observed, nor how much the increase or decrease will be.

In a different situation the probability of \({\small\verb;urgent;}\) could actually increase as more and more \({\small\verb;non-urgent;}\) cases are observed. Imagine, for instance, a scenario where the agent initially knows that there are 10 urgent and 90 non-urgent cases ahead. Having observed 90 non-urgent cases, the agent will give a much higher probability – 100% – that the next case will be an urgent one.

The differences among such scenarios are reflected in differences of the joint probabilities, from which the conditional probabilities are calculated.

All these situations are correctly handled with the four fundamental rules of inference and the formula for conditional probability derived from them.

Exercises
  1. Using the same joint distribution above, calculate

    \[\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\]

    that is, the probability that the next patient will require urgent care given that the agent knows the second and third patients will not-require urgent care.

    • Why is the value obtained different from  \(\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\) ?

    • Describe a scenario in which the calculation above makes sense (and patients 2 and 3 still arrive after patient 1).


  1. Do an analysis completely analogous to the previous, three-patient one, but with different background information \(\mathsfit{J}\) that gives a joint probability distribution for \((U_1, U_2, U_3)\) as follows:

    • If \({\small\verb;urgent;}\) appears 0 times out of 3: probability = \(0\%\)
    • If \({\small\verb;urgent;}\) appears 1 times out of 3: probability = \(24.5\%\)
    • If \({\small\verb;urgent;}\) appears 2 times out of 3: probability = \(7.8\%\)
    • If \({\small\verb;urgent;}\) appears 3 times out of 3: probability = \(3.1\%\)

    1. Calculate

      \[\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{J})\]

      and

      \[\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{J})\] and compare them.

    2. Explain why this particular change in degree of belief occurs, in this situation.

17.4 General case: conditional from joint

Take the time to review the two sections above, focusing on the application and meaning of the two scenarios and calculations, and noting the similarities and differences:

  • The calculations were completely analogous; in particular, the conditional probability was obtained as the quotient of a joint probability and a marginal one.

  • In the first (next-patient) scenario, information about one aspect of the situation changed the agent’s belief about another aspect; the two aspects were somewhat different (transportation and urgency). Whereas in the second (three-patient) scenario, information about analogous occurrences of an aspect of the situation changed the agent’s belief about a further occurrence.


A third scenario is also possible, which combines the two above. Consider the case with three patients, where each patient can require \({\small\verb;urgent;}\) care or not, and can be transported by \({\small\verb;ambulance;}\), \({\small\verb;helicopter;}\), or \({\small\verb;other;}\) means. To describe this situation, introduce three pairs of quantities, which together form the joint quantity

\[ (U_1, T_1, \ U_2, T_2, \ U_3, T_3) \]

whose meaning should now be obvious. This joint quantity has \((2\cdot 3)^3 = 216\) possible values, corresponding to all urgency & transportation combinations for the three patients.

Given the joint probability distribution for this joint quantity, it is possible to calculate all kinds of conditional probabilities, which reflect the knowledge that the agent may have acquired. For instance, suppose the agent has observed that

  • the first two patients have not required urgent care
  • the first patient was transported by ambulance
  • the second patient was transported by other means
  • the third patient is arriving by ambulance

and needs to infer whether the third patient will require urgent care. The required probability is

\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) } \end{aligned} \]

and is calculated in a way completely analogous to the ones already seen.


All three kinds of inference scenarios frequently occur in data science and engineering. In machine learning, the second scenario is connected to “unsupervised learning”; the third, mixed one to “supervised learning”. As you just saw, the probability calculus “sees” all of the them as analogous: information about something changes the agent’s belief about something else. And the handling of all three cases is perfectly covered by the four fundamental rules of inference.


Let’s now consider a more generic case of a joint quantity with component quantities \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\), whose joint probability distribution is given. The two quantities could be complicated joint quantities themselves. The conditional probability for \(\color[RGB]{238,102,119}Y\), given that \(\color[RGB]{34,136,51}X\) has some specific value \(\color[RGB]{34,136,51}x^*\), is then

\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon}\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon}\mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.1}\]

for all possible values \(\color[RGB]{238,102,119}y\).


17.5 Conditional from conditional probability

As emphasized in §  5.2, probabilities are either obtained from other probabilities, or taken as given probabilities, maybe determined by symmetry requirements. This is also true when we want to calculate conditional probabilities.

Up to now we have calculated conditional probabilities starting from the joint distribution as given, using the derived formula (17.1). In some situations, however, an agent may have given conditional probabilities together with given marginal probabilities.

As an example let’s consider a variation of our next-patient scenario one more time. The agent has background information \(\mathsfit{I}_{\text{S}}\) that provides the following set of probabilities:

  • Two conditional probability distributions  \(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) for transportation \(T\) given urgency \(U\), as reported in the following table:
Table 17.1: Probability distributions for transportation given urgency
\(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}t \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) transportation at arrival  \(T\nonscript\:\vert\nonscript\:\mathopen{}{}\)
ambulance helicopter other
given urgency  \({}\nonscript\:\vert\nonscript\:\mathopen{}U\) urgent 0.61 0.22 0.17
non-urgent 0.21 0.01 0.78
Important

This table has two probability distributions: on the first row, one conditional on \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\); on the second row, one conditional on \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\). Check that the probabilities on each row indeed sum up to one.


  • Marginal probability distribution  \(\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}})\) for urgency \(U\):

\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.18 \ , \quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.82 \tag{17.2}\]


With this background information, the agent can also compute all joint probabilities simply using the and-rule. For instance

\[ \begin{aligned} &P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 = \boldsymbol{3.96\%} \end{aligned} \]

Note that the joint probabilities are slightly different compared with those from the previous background information \(\mathsfit{I}_{\text{H}}\).

And from the joint probabilities, the marginal ones for transportation \(T\) can also be calculated. For instance

\[ \begin{aligned} &P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 + 0.01 \cdot 0.82 \\[1ex] &\quad{}= \boldsymbol{4.78\%} \end{aligned} \]

Suppose that the agent knows that the next patient is being transported by \({\small\verb;helicopter;}\), and needs to forecast whether \({\small\verb;urgent;}\) care will be needed. This inference is the conditional probability  \(\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\), which can also be rewritten in terms of the set of probabilities initially given:

\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \\[2ex] &\quad{}=\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \\[1ex] &\quad{}=\frac{ P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) }{ \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) } \\[1ex] &\quad{}=\frac{0.0396}{0.0478} \\[2ex] &\quad{}=\boldsymbol{82.8\%} \end{aligned} \]

This calculation has been slightly more involved than the one in §  17.2 because the joint probabilities were not directly available. Our calculation involved the steps  “ \(T\nonscript\:\vert\nonscript\:\mathopen{}U \longrightarrow T\land U \longrightarrow U\nonscript\:\vert\nonscript\:\mathopen{}T\) ”.


If the agent were instead interested, say, in forecasting the transportation means knowing that the next patient requires urgent care, then the relevant degree of belief  \(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) would be immediately available and no calculations would be needed.

17.6 General case: conditional from conditional

The example from the previous section can be easily generalized. Consider a joint quantity with component quantities \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\). The probabilities  \(\mathrm{P}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\)  and  \(\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\)  are given.

The conditional probability for \(\color[RGB]{238,102,119}Y\), given that \(\color[RGB]{34,136,51}X\) has some specific value \(\color[RGB]{34,136,51}x^*\), is then

\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon} \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.3}\]

for all possible values \(\color[RGB]{238,102,119}y\).

In the above formula we recognize Bayes’s theorem from §  9.5.

This formula is often exaggeratedly emphasized in the literature; some texts even present it as an “axiom” to be used in situations such as the present one. But we see that it is simply a by-product of the four fundamental rules of inference in a specific situation. An AI agent who knows the four fundamental inference rules, and doesn’t know what “Bayes’s theorem” is, will nevertheless arrive at this very formula.

17.7 Conditional densities

The discussion so far about conditional probabilities extends to conditional probability densities, in the usual way explained in §§ 15.3 and 16.2.

If \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\) are continuous quantities, the notation

\[ \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = {\color[RGB]{68,119,170}q} \]

means that, given background information \(\mathsfit{I}\) and given the sentence \(\color[RGB]{34,136,51}X\) has value between \(\color[RGB]{34,136,51}x-\delta/2\) and \(\color[RGB]{34,136,51}x+\delta/2\), the sentence \(\color[RGB]{238,102,119}Y\) has value between \(\color[RGB]{238,102,119}y-\epsilon/2\) and \(\color[RGB]{238,102,119}y+\epsilon/2\) has probability \({\color[RGB]{68,119,170}q}\cdot{\color[RGB]{238,102,119}\epsilon}\), as long as \(\color[RGB]{34,136,51}\delta\) and \(\color[RGB]{238,102,119}\epsilon\) are small enough. Note that the small interval \(\color[RGB]{34,136,51}\delta\) for \(\color[RGB]{34,136,51}X\) is not multiplied by the density \(\color[RGB]{68,119,170}q\).

The relation between a conditional density and a joint density or a different conditional density is given by

\[ \begin{aligned} &\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon}\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon} \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \end{aligned} \]

where \(\color[RGB]{238,102,119}\varUpsilon\) is the domain of \(\color[RGB]{238,102,119}Y\).

17.8 Graphical representation of conditional probability distributions and densities

Conditional probability distributions and densities can be plotted in all the ways discussed in chapters 15 and 16. If we have two quantities \(A\) and \(B\), often we want to compare the different conditional probability distributions for \(A\) given different values of \(B\):

  • \(\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;one-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\),
  • \(\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;another-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\),
  • \(\dotsc\)

and so on. This can be achieved by representing them by overlapping line plots, or side-by-side scatter plots, or similar ways.


In §  16.3 we saw that if we have the scatter plot for a joint probability density, then from its points we can often obtain a scatter plot for its marginal densities. Unfortunately no similar advantage exists for the conditional densities that can be obtained from a joint density. In theory, a conditional density for \(Y\), given that a quantity \(X\) has value in some small interval \(\delta\) around \(x\), could be obtained by only considering scatter-plot points having \(X\) coordinate in a small interval between \(x-\delta/2\) and \(x+\delta/2\). But the number of such points is usually too small and the resulting scatter plot could be very misleading.


Study reading