factorial(2^10) / factorial(2^10)[1] NaN
\(\DeclarePairedDelimiter{\set}{\{}{\}}\)
We now try to build up a real prototype AI agent from basic principles, using the formulae summarized in ch. 29 and in the previous chapter. By design, this agent is as close to optimal as theoretically possible; so let’s call it an

or OPM for short.
Before starting, let’s agree on some terminology so as not to get confused in the discussion below.
We design our Optimal Predictor Machine with the following specific characteristics:
It handles variates of nominal type (§ 12.2).
It handles inferences and decisions about approximately infinite populations, and its beliefs about the population are exchangeable (ch. 25).
Its initial beliefs about the population frequencies are represented by a Dirichlet-mixture distribution (ch. 28).
Before deployment, it learns from a set of \(N\) units.
Let’s give an example of what we want our agent to be able to do. Suppose we have a population having three nominal variates \(Z = (Y \mathbin{\mkern-0.5mu,\mkern-0.5mu}X \mathbin{\mkern-0.5mu,\mkern-0.5mu}W)\) (keep in mind that \(X\), \(Y\), \(W\) could each be a joint variate). Abbreviate the set of \(N\) training data as
\(\mathsfit{\color[RGB]{34,136,51}data}\coloneqq ( Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{2} \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} )\)
Recall that \(Z\) denotes all (nominal) variates of the population
where \(z_N, \dotsc, z_2, z_1\) are specific values, stored in some training dataset. To simplify things, we assume that no values are missing.
We want an agent that can draw inferences like the following ones, as often as required:
\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\), \(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\), etc.: inference about a predictand variate, without knowledge of any predictors.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same but for any two predictand variates.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same but for all three variates.
\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): inference about any one predictand variate, given information about one predictor variate.
\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{} Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same, but given information about any pair of predictors.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): inference about any two predictand variates, given information about one predictor.
Note that we are not fixing beforehand which variates are predictands and which are predictors. Once the agent has learnt from the training data, we want to be able to change on the fly, at each new application, what the predictands are, and what the predictors are (if any).
Pause for a second and ponder about the flexibility that we are requesting from our prototype agent! Consider that virtually all present-day machine-learning algorithms only work one way: a machine-learning algorithm designed to guess a label from some features cannot guess features from a label. Will we really manage to build an agent with the amazing versatility illustrated above?
The examples above of requested inferences show that the OPM agent must essentially use formulae (28.3)–(28.5) from § 28.3, which we repeat here:
The values of \(\operatorname{aux}(k)\) can be calculated just once, when the OPM agent is built, and stored. Subsequently the agent will draw inferences by using (28.3) or (28.4) as needed. To use those formulae, the agent needs to store the counts \(\#(x,y,w,\dotsc)\), which it found in the training data, for all combinations of values \(x,y,w,\dotsc\).
This kind of storage and computation could be implemented in a straightforward way if we had unlimited storage and computation precision. But in a real implementation we must face several difficulties. Here are the main difficulties and their solutions:
infinity, and small non-zero numbers as 0. For instance this is what happens if we directly compute something like \((2^{10})! / (2^{10})!\), obviously equal to \(1\):
factorial(2^10) / factorial(2^10)[1] NaN
One way to bypass this problem is by rewriting the formulae in ways that are mathematically equivalent but less prone to over- and under-flow. For example we can use identities like
\[ x / y = \exp\bigl(\ln x - \ln y\bigr)\ ,\quad x, y > 0 \ . \]
Now indeed it works; note that lfactorial() is log(factorial()) in R:
exp( lfactorial(2^10) - lfactorial(2^10) )[1] 1
Another useful identity that avoids over- and under-flow, if \(\pmb{x}\) is a vector of positive numbers, is the following:
\[ \frac{\pmb{x}}{\operatorname{\texttt{sum}}(\pmb{x})} = \frac{ \operatorname{\texttt{exp}}\bigl(\operatorname{\texttt{log}}(\pmb{x}) - \operatorname{\texttt{max}}(\operatorname{\texttt{log}}(\pmb{x}))\bigr) }{\operatorname{\texttt{sum}}\bigl( \operatorname{\texttt{exp}}\bigl(\operatorname{\texttt{log}}(\pmb{x}) - \operatorname{\texttt{max}}(\operatorname{\texttt{log}}(\pmb{x}))\bigr) \bigr)} \]
try( x <- integer(length = 4^20) )Error : cannot allocate vector of size 4096.0 Gb
We can bypass this problem again by using smart mathematical manipulations. In the case of formulae (28.3)–(28.5), the product over all possible values \((x,y,w,\dotsc)\) can be rewritten in one over all different values of the counts, which usually has much fewer terms. For example, if we have \(N=10000\) datapoints, and \(4^{20} -1\) counts are equal to \(9000\), while one count is equal to \(1000\), then we only need to store these four numbers rather than \(4^{20}\) numbers!
You see that mathematical “tricks” become very important when we must implement formulae with finite precision and limited memory. Unfortunately getting acquainted with such tricks requires a separate course.
Note that the solutions just discussed are not approximations. Even if we use different mathematical formulae, they are still equivalent to the original ones. The internal logic of our OPM agent is therefore still fully correct. In other situations the mathematical or computational solutions above may not be enough, and then we may need to resort to approximations, as it often happens with machine-learning algorithms.
Try to prove some of the mathematical identities above.
We implement the OPM agent and its inferences through three main functions. These and other helper functions are defined in the script OPM_nominal.R:
buildagent()We can used this function to build several agents, which differ in their background information or in the data they have learned.
We write this function with an option savememory for storing the learned information in a memory-efficient way, if needed.
infer()The formulae must be implemented in slightly different ways, depending on whether the learned information is stored in a memory-efficient way. For this reason we actually have two implementations of this function, called infer.agent() and infer.agentcompressed().
This function can be used as often as we please, and with any agents we please.
decide()Open the script OPM_nominal.R and locate the functions buildagent() and infer.agent(). In each, identify the lines of code that implement the various calculations discussed above. Note that alpha stands for \(2^k\), and counts stands for the array or list of counts \(\#(y,x,\dotsc)\).
Besides the three main functions above, we can write other functions that help us handling the inference task and calculate other quantities of interest:
guessmetadata()rF()plotFsamples1D()infer().
mutualinfo()
In the next chapter we give some more documentation on these functions and on how to use them, and in ch. 33 we use them in a concrete task.