17 Conditional probability and learning
\[ \DeclarePairedDelimiters{\set}{\{}{\}} \DeclareMathOperator*{\argmax}{argmax} \]
17.1 The meaning of the term “conditional probability”
When we introduced the notion of degree of belief – a.k.a. probability – in chapter 8, we emphasized that every probability is conditional on some state of knowledge or information. So the term “conditional probability” sounds like a pleonasm, just like saying “round circle”.
This term must be understood in a way analogous to “marginal probability”: it applies in situations where we have two or more sentences of interest. We speak of a “conditional probability” when we want to emphasize that additional sentences appear in the conditional (right side of “\(\nonscript\:\vert\nonscript\:\mathopen{}\)”) of that probability. For instance, in a scenario with these two probabilities:
\[ \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{204,187,68}B} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \qquad \mathrm{P}(\mathsfit{A} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \]
we call the first conditional probability of \(\mathsfit{A}\) (given \(\mathsfit{\color[RGB]{204,187,68}B}\)) to emphasize or point out that its conditional includes the additional sentence \(\mathsfit{\color[RGB]{204,187,68}B}\), whereas the conditional of the second probability doesn’t include this sentence.
17.2 The relation between learning and conditional probability
Why do we need to emphasize that a particular degree of belief is conditional on an additional sentence? Because the additional sentence usually represents new information that the agent has learned.
Remember that the conditional of a probability usually contains all factual information known to the agent1. Therefore if an agent acquires new data or a new piece of information expressed by a sentence \(\color[RGB]{204,187,68}\mathsfit{D}\), it should draw inferences and make decisions using probabilities that include \(\color[RGB]{204,187,68}\mathsfit{D}\) in their conditional. In other words, the agent before was drawing inferences and making decisions using some probabilities
1 Exceptions are, for instance, when the agent does counterfactual or hypothetical reasoning, as we discussed in § 5.1.
\[ \mathrm{P}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{K}) \]
where \(\mathsfit{K}\) is the agent’s knowledge until then. Now that the agent has acquired information or data \(\color[RGB]{204,187,68}\mathsfit{D}\), it will draw inferences and make decisions using probabilities
\[ \mathrm{P}(\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{204,187,68}D} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{K}) \]
Vice versa, if we see that an agent is calculating new probabilities conditional on an additional sentence \(\color[RGB]{204,187,68}\mathsfit{D}\), then it means2 that the agent has acquired that information or data \(\color[RGB]{204,187,68}\mathsfit{D}\).
2 But keep again in mind exceptions like counterfactual reasoning; see the previous side note.
Therefore conditional probabilities represent an agent’s learning and should be used when an agent has learned something.
This learning can be of many different kinds. Let’s examine two particular kinds by means of some examples.
17.3 Learning about a quantity from a different quantity
Consider once more the next-patient arrival scenario of § 15.2, with joint quantity \((U,T)\) and an agent’s joint probability distribution as in table 15.1, reproduced here:
\(\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}t\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}_{\text{H}})\) | transportation at arrival \(T\) | |||
ambulance | helicopter | other | ||
urgency \(U\) | urgent | 0.11 | 0.04 | 0.03 |
non-urgent | 0.17 | 0.01 | 0.64 |
Suppose that the agent must forecast whether the next patient will require \({\small\verb;urgent;}\) or \({\small\verb;non-urgent;}\) care, so it needs to calculate the probability distribution for \(U\) (that is, the probabilities for \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\) and \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)).
In the first exercise of § 16.1 you found that the marginal probability that the next patient will need urgent care is
\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = 18\%\]
this is the agent’s degree of belief if it has nothing more and nothing less than the knowledge encoded in the sentence \(\mathsfit{I}_{\text{H}}\).
But now let’s imagine that the agent receives a new piece of information: it is told that the next patient is being transported by helicopter. In other words, the agent has learned that the sentence \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\) is true. The agent’s complete knowledge is therefore now encoded in the and
ed sentence
\[T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\ \land\ \mathsfit{I}_{\text{H}}\]
and this composite sentence should appear in the conditional. The agent’s belief that the next patient requires urgent care, given the new information, is therefore
\[\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}})\]
Calculation of this probability can be done by just one application of the and
-rule, leading to a formula connected with Bayes’s theorem (§ 9.4):
\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) = \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \cdot \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) \\[3ex] &\quad\implies\quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) = \frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \end{aligned} \]
Let’s see how to calculate this. The agent already has the joint probability for \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\land T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\) that appears in the numerator of the fraction above. The probability in the denominator is just a marginal probability for \(T\), and we know how to calculate that too from § 16.1. So we find
\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) =\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \sum_u\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \]
where it’s understood that the sum index \(u\) runs over the values \(\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}\).
This is called a conditional probability; in this case, the conditional probability of \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\) given \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\).
The collection of probabilities for all possible values of the quantity \(U\), given a specific value of the quantity \(T\), say \({\small\verb;helicopter;}\):
\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \ , \qquad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \]
is called the conditional probability distribution for \(U\) given \(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\). It is indeed a probability distribution because the two probabilities sum up to 1.
17.4 Learning about a quantity from instances of similar quantities
In the previous section we examined how learning about one quantity can change an agent’s degree of belief about a different quantity, for example knowledge about “transportation” affects beliefs about “urgency”, or vice versa. The agent’s learning and ensuing belief change are reflected in the value of the corresponding conditional probability.
This kind of change can also occur with “similar” quantities, that is, quantities that represent the same kind of phenomenon and have the same domain. The maths and calculations are identical to the ones we explored above, but the interpretation and application can be somewhat different.
As an example, imagine a scenario similar to the next-patient arrival above, but now consider the next three patients to arrive and their urgency. Define the following three quantities:
\(U_1\) : urgency of the next patient
\(U_2\) : urgency of the second future patient from now
\(U_3\) : urgency of the third future patient from now
Each of these quantities has the same domain: \(\set{{\small\verb;urgent;},{\small\verb;non-urgent;}}\).
The joint quantity \((U_1, U_2, U_3)\) has a domain with \(2^3 = 8\) possible values:
- \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)
- \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)
- . . .
- \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)
- \(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\)
Suppose that an agent, with background information \(\mathsfit{I}\), has a particular joint belief distribution for the joint quantity \((U_1, U_2, U_3)\). For example consider the joint distribution implicitly given as follows:
- If \({\small\verb;urgent;}\) appears in the probability 0 times out of 3: probability = \(53.6\%\)
- If \({\small\verb;urgent;}\) appears 1 times out of 3: probability = \(11.4\%\)
- If \({\small\verb;urgent;}\) appears 2 times out of 3: probability = \(3.6\%\)
- If \({\small\verb;urgent;}\) appears 3 times out of 3: probability = \(1.4\%\)
Here are some examples of how the probability values are determined by the description above:
\[ \begin{aligned} &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 \quad&&\text{\small(${\small\verb;urgent;}$ appears twice)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.114 &&\text{\small(${\small\verb;urgent;}$ appears once)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.036 &&\text{\small(${\small\verb;urgent;}$ appears twice)} \\[1ex] &\mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) = 0.536 &&\text{\small(${\small\verb;urgent;}$ doesn't appear)} \end{aligned} \]
From this joint probability distribution the agent can calculate, among other things, its degree of belief that the third patient will require urgent care, regardless of the urgency of the preceding two patients. It’s the marginal probability
\[ \begin{aligned} \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) &= \sum_{u_1}\sum_{u_2} \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_1 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u_2 \mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \\[1ex] &= 0.114 + 0.036 + 0.036 + 0.014 \\[1ex] &= \boldsymbol{20.0\%} \end{aligned} \]
where each index \(u_1\) and \(u_2\) runs over the values \(\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}\). This double sum therefore involves four terms. The first term in the sum corresponds to “\(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)” and therefore has probability \(0.014\) . The second term corresponds to “\(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\)” and therefore has probability \(0.036\). And so on.
Therefore the agent, with its current knowledge, has a \(20\%\) degree of belief that the third patient will require urgent care.
Now fast-forward in time, after two patients have arrived and have been taken good care of; or maybe they haven’t arrived yet, but their urgency conditions have been ascertained and communicated to the agent. Suppose that both patients were or are non-urgent cases. The agent now knows this fact. The agent needs to forecast whether the third patient will require urgent care.
The relevant degree of belief is obviously not \(\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I})\), calculated above, because this belief represents an agent knowing only \(\mathsfit{I}\). Now, instead, the agent has additional information about the first two patients, encoded in this and
ed sentence:
\[ U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;} \]
The relevant degree of belief is therefore the conditional probability
\[ \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \]
Which we can calculate with the same procedure as in the previous section:
\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.114}{0.65} \\[2ex] &\qquad{}\approx \boldsymbol{17.5\%} \end{aligned} \]
This conditional probability \(17.5\%\) for \(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\) is lower than \(20.0\%\) calculated previously, which was based only on knowledge \(\mathsfit{I}\). Learning about the two first patients has thus affected the agent’s degree of belief about the third.
Let’s also check how the agent’s belief changes in the case where the first two patients are both urgent instead. The calculation is completely analogous:
\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) } \\[1ex] &\qquad{}=\frac{0.030}{0.107} \\[2ex] &\qquad{}\approx \boldsymbol{28.0\%} \end{aligned} \]
In this case the conditional probability \(28.0\%\) for \(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\) is higher than the \(20.0\%\), which was based only on knowledge \(\mathsfit{I}\).
One possible intuitive explanation of these probability changes, in the present scenario, is that observation of two non-urgent cases makes the agent slightly more confident that “this is a day with few urgent cases”. Whereas observation of two urgent cases makes the agent more confident that “this is a day with many urgent cases”.
17.5 Learning in the general case
Take the time to review the two sections above, focusing on the application and meaning of the two scenarios and calculations, and noting the similarities and differences:
The calculations were completely analogous. In particular, the conditional probability was obtained as the quotient of a joint probability and a marginal one.
In the first (urgency & transportation) scenario, information about one aspect of the situation changed the agent’s belief about another aspect. The two aspects were different (transportation and urgency). Whereas in the second (three-patient) scenario, information about analogous occurrences of an aspect of the situation changed the agent’s belief about a further occurrence.
A third scenario is also possible, which combines the two above. Consider the case with three patients, where each patient can require \({\small\verb;urgent;}\) care or not, and can be transported by \({\small\verb;ambulance;}\), \({\small\verb;helicopter;}\), or \({\small\verb;other;}\) means. To describe this situation, introduce three pairs of quantities, which together form the joint quantity
\[ (U_1, T_1, \ U_2, T_2, \ U_3, T_3) \]
whose symbols should be obvious. This joint quantity has \((2\cdot 3)^3 = 216\) possible values, corresponding to all urgency & transportation combinations for the three patients.
Given the joint probability distribution for this joint quantity, it is possible to calculate all kinds of conditional probabilities, and therefore consider all the possible ways the agent may learn new information. For instance, suppose the agent learns this:
- the first two patients have not required urgent care
- the first patient was transported by ambulance
- the second patient was transported by other means
- the third patient is arriving by ambulance
and with this learned knowledge, the agent needs to infer whether the third patient will require urgent care. The required conditional probability is
\[ \begin{aligned} &\mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) \\[2ex] &\qquad{}= \frac{ \mathrm{P}(U_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \mathrm{P}(T_3\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_1\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;ambulance;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} U_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T_2\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;other;}\mathbin{\mkern-0.5mu,\mkern-0.5mu} \mathsfit{I}) } \end{aligned} \]
and is calculated in a way completely analogous to the ones already seen.
All three kinds of inference scenarios that we have discussed occur in data science and engineering. In machine learning, the second scenario is connected to “unsupervised learning”; the third, mixed scenario to “supervised learning”. As you just saw, the probability calculus “sees” all of these scenarios as analogous: information about something changes the agent’s belief about something else. And the handling of all three cases is perfectly covered by the four fundamental rules of inference.
So let’s write down the general formula for all these cases of learning.
Let’s consider a more generic case of a joint quantity with component quantities \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\). Their joint probability distribution is given. Each of these two quantities could be a complicated joint quantity by itself.
The conditional probability for \(\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y\), given that the agent has learned that \(\color[RGB]{34,136,51}X\) has some specific value \(\color[RGB]{34,136,51}x^*\), is then
\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon}\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon}\mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.1}\]
where the index \(\color[RGB]{238,102,119}\upsilon\) runs over all possible values in the domain of \(\color[RGB]{238,102,119}Y\).
17.6 Conditional probabilities as initial information
Up to now we have calculated conditional probabilities, using the derived formula (17.1), starting from the joint probability distribution, which we considered to be given. In some situations, however, an agent may initially possess not a joint probability distribution but conditional probabilities together with marginal probabilities.
As an example let’s consider a variation of our next-patient scenario one more time. The agent has background information \(\mathsfit{I}_{\text{S}}\) that provides the following set of probabilities:
- Two conditional probability distributions \(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) for transportation \(T\) given urgency \(U\), as reported in the following table:
\(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}t \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) | transportation at arrival \(T\nonscript\:\vert\nonscript\:\mathopen{}{}\) | |||
ambulance | helicopter | other | ||
given urgency \({}\nonscript\:\vert\nonscript\:\mathopen{}U\) | urgent | 0.61 | 0.22 | 0.17 |
non-urgent | 0.21 | 0.01 | 0.78 |
- Marginal probability distribution \(\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}})\) for urgency \(U\):
\[ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.18 \ , \quad \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;non-urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) = 0.82 \tag{17.2}\]
With this background information, the agent can also compute all joint probabilities simply using the and
-rule. For instance, the joint probability for \(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\) is
\[ \begin{aligned} &P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 = \boldsymbol{3.96\%} \end{aligned} \]
And from the joint probabilities, the marginal ones for transportation \(T\) can also be calculated. For instance
\[ \begin{aligned} &P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) \\[1ex] &\quad{}= 0.22 \cdot 0.18 + 0.01 \cdot 0.82 \\[1ex] &\quad{}= \boldsymbol{4.78\%} \end{aligned} \]
Now suppose that the agent learns that the next patient is being transported by \({\small\verb;helicopter;}\), and needs to forecast whether \({\small\verb;urgent;}\) care will be needed. This inference is the conditional probability \(\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\), which can also be rewritten in terms of the conditional probabilities given initially:
\[ \begin{aligned} &\mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{H}}) \\[2ex] &\quad{}=\frac{ \mathrm{P}(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) }{ \mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{H}}) } \\[1ex] &\quad{}=\frac{ P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) }{ \sum_u P(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;helicopter;}\nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}}) \cdot P(U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}u \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}_{\text{S}}) } \\[1ex] &\quad{}=\frac{0.0396}{0.0478} \\[2ex] &\quad{}=\boldsymbol{82.8\%} \end{aligned} \]
This calculation was slightly more involved than the one in § 17.3, because in the present case the joint probabilities were not directly available. Our calculation involved the steps \(T\nonscript\:\vert\nonscript\:\mathopen{}U \enspace\longrightarrow\enspace T\land U \enspace\longrightarrow\enspace U\nonscript\:\vert\nonscript\:\mathopen{}T\) .
In this same scenario, note that if the agent were instead interested, say, in forecasting the transportation means knowing that the next patient requires urgent care, then the relevant degree of belief \(\mathrm{P}(T\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} U\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;urgent;}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\text{S}})\) would be immediately available and no calculations would be needed.
Let’s find the general formula for this case, where the agent’s background information is represented by conditional probabilities instead of joint probabilities.
Consider a joint quantity with component quantities \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\). The conditional probabilities \(\mathrm{P}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\) and \(\mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\) are encoded in the agent from the start.
The conditional probability for \(\color[RGB]{238,102,119}Y \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y\), given that the agent has learned that \(\color[RGB]{34,136,51}X \mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*\), is then
\[ \mathrm{P}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) }{ \sum_{\color[RGB]{238,102,119}\upsilon} \mathrm{P}( {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x^*}\nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{P}( {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{I}) } \tag{17.3}\]
In the above formula we recognize Bayes’s theorem from § 9.4.
This formula is often exaggeratedly emphasized in the literature; some texts even present it as an “axiom” to be used in situations such as the present one. But we see that this formula is simply a by-product of the four fundamental rules of inference in a specific situation. An AI agent who knows the four fundamental inference rules, and doesn’t know what “Bayes’s theorem” is, will nevertheless arrive at this very formula.
17.7 Conditional densities
The discussion so far about conditional probabilities extends to conditional probability densities, in the usual way explained in §§15.3 and 16.2.
If \(\color[RGB]{34,136,51}X\) and \(\color[RGB]{238,102,119}Y\) are continuous quantities, the notation
\[ \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = {\color[RGB]{68,119,170}q} \]
means that, given background information \(\mathsfit{I}\) and given the sentence “\(\color[RGB]{34,136,51}X\) has value between \(\color[RGB]{34,136,51}x-\delta/2\) and \(\color[RGB]{34,136,51}x+\delta/2\)”, the sentence “\(\color[RGB]{238,102,119}Y\) has value between \(\color[RGB]{238,102,119}y-\epsilon/2\) and \(\color[RGB]{238,102,119}y+\epsilon/2\)” has probability \({\color[RGB]{68,119,170}q}\cdot{\color[RGB]{238,102,119}\epsilon}\), as long as \(\color[RGB]{34,136,51}\delta\) and \(\color[RGB]{238,102,119}\epsilon\) are small enough. Note that the small interval \(\color[RGB]{34,136,51}\delta\) for \(\color[RGB]{34,136,51}X\) is not multiplied by the density \(\color[RGB]{68,119,170}q\).
The relation between a conditional density and a joint density or a different conditional density is given by
\[ \begin{aligned} &\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon}\mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}{\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) \, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \\[1ex] &\quad{}= \frac{\displaystyle \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{\displaystyle \int_{\color[RGB]{238,102,119}\varUpsilon} \mathrm{p}({\color[RGB]{34,136,51}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x} \nonscript\:\vert\nonscript\:\mathopen{} {\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) \cdot \mathrm{p}({\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\upsilon} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\, \mathrm{d}{\color[RGB]{238,102,119}\upsilon} } \end{aligned} \]
where \(\color[RGB]{238,102,119}\varUpsilon\) is the domain of \(\color[RGB]{238,102,119}Y\).
17.8 Graphical representation of conditional probability distributions and densities
Conditional probability distributions and densities can be plotted in all the ways discussed in chapters 15 and 16. If we have two quantities \(A\) and \(B\), often we want to compare the different conditional probability distributions for \(A\), conditional on different values of \(B\):
- \(\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;one-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\),
- \(\mathrm{P}(A\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \nonscript\:\vert\nonscript\:\mathopen{} B\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;another-value;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I})\),
- \(\dotsc\)
and so on. This can be achieved by representing them by overlapping line plots, or side-by-side scatter plots, or similar ways.
In § 16.3 we saw that if we have the scatter plot for a joint probability density, then from its points we can often obtain a scatter plot for its marginal densities. Unfortunately no similar advantage exists for the conditional densities that can be obtained from a joint density. In theory, a conditional density for \(Y\), given that a quantity \(X\) has value in some small interval \(\delta\) around \(x\), could be obtained by only considering scatter-plot points having \(X\) coordinate in a small interval between \(x-\delta/2\) and \(x+\delta/2\). But the number of such points is usually too small and the resulting scatter plot could be very misleading.