30 A look behind

Published

2025-10-27

In the past chapters we have learned a lot of theory and fundamentals, illustrated with some simple examples. Now we shall finally put the theory into practice! We shall build a prototype AI agent that implements the theory as closely as possible.

Before proceeding, let’s take a quick look back at the road behind us and recall the most important milestones:

In order to make decisions and to act in an optimal way, an agent needs to maximize expected utility. The agent must quantify its “values” (utilities) and its degrees of belief (probabilities) about the consequences of the problem.
In order to calculate its degrees of belief in a logically consistent way, an agent must use the four fundamental rules of inference:

\(\mathrm{P}(\lnot X \nonscript\:\vert\nonscript\:\mathopen{} Z) = 1 - \mathrm{P}(X \nonscript\:\vert\nonscript\:\mathopen{} Z)\)
\(\mathrm{P}(X \mathbin{\mkern-0.5mu,\mkern-0.5mu}Y \nonscript\:\vert\nonscript\:\mathopen{} Z) = \mathrm{P}(X \nonscript\:\vert\nonscript\:\mathopen{} Y \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z) \cdot \mathrm{P}(Y \nonscript\:\vert\nonscript\:\mathopen{} Z) = \mathrm{P}(Y \nonscript\:\vert\nonscript\:\mathopen{} X \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z) \cdot \mathrm{P}(X \nonscript\:\vert\nonscript\:\mathopen{} Z)\)
\(\mathrm{P}(X \lor Y \nonscript\:\vert\nonscript\:\mathopen{} Z) = \mathrm{P}(X \nonscript\:\vert\nonscript\:\mathopen{} Z) + \mathrm{P}(Y \nonscript\:\vert\nonscript\:\mathopen{} Z) - \mathrm{P}(X \mathbin{\mkern-0.5mu,\mkern-0.5mu}Y \nonscript\:\vert\nonscript\:\mathopen{} Z)\)
\(\mathrm{P}(X \nonscript\:\vert\nonscript\:\mathopen{} X \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z) = 1\)

Any departure from these rules will lead to small or large logical errors.
When an agent must draw inferences about populations, having observed \(N\) units (training data) from the population, the four rules lead to the general formula

\(\mathrm{P}( Z_{N+1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \nonscript\:\vert\nonscript\:\mathopen{} Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \mathrm{P}( Z_{N+1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \sum_z \mathrm{P}( Z_{N+1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }\)

and some slight variations of it.

The probability distribution \(\mathrm{P}( Z_{N+1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\) must be built-in in the agent.
When an agent must draw inferences about an approximately infinite population, having observed \(N\) units (training data) from the population, and the agent has exchangeable beliefs about the population, the four rules lead to the general formula

\(\mathrm{P}( Z_{N+1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z \nonscript\:\vert\nonscript\:\mathopen{} Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu}Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}) = \frac{ \sum_{\boldsymbol{f}} f(Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z) \cdot f(Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N}) \cdot \, \dotsb\, \cdot f(Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1}) \cdot \mathrm{P}(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }{ \sum_{\boldsymbol{f}} f(Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N}) \cdot \, \dotsb\, \cdot f(Z\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1}) \cdot \mathrm{P}(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I}) }\)

and some slight variations of it.

The probability distribution \(\mathrm{P}(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{I})\) must be built-in in the agent.

Pay attention to the fact that all inference formulae about populations came straight from the four fundamental rules of inference. We did not use intuition, and we did not use any “models”.

In the next chapter we address the practical question of implementing the formulae above into code.