34 Prototype code and workflow

Published

2024-11-14

A concise documentation is here given of the prototype R functions designed in chapter 33 and described in § 33.2, together with an example workflow for their use.

The functions can be found in
https://github.com/pglpm/ADA511/tree/master/code/OPM-nominal

34.1 Function documentation

Optional arguments are written with =..., which specify their default values. Some additional optional arguments, mainly used for testing, are omitted in this documentation.

guessmetadata(data, file=NULL)

Arguments:

data: either a string with the file name of a dataset in .csv format (with header line), or a dataset given as a data.table object.
file: a string specifying the file name of the metadata file. If no file is given and data is a file name, then file will be the same name as data but with the prefix meta_. If no file is given and data is not a string, then the metadata are output to stdout.

Output:

either a .csv file containing the metadata, or a data.table object as stdout.

buildagent(metadata, data=NULL, kmi=0, kma=20)

Arguments:

metadata: either a string with the name of a metadata file in .csv format, or metadata given as a data.table.
data: either a string with the file name of a training dataset in .csv format (with header line), or a training dataset given as a data.table.
kmi: the \(k_{\text{mi}}\) parameter of formula (33.1).
kma: the \(k_{\text{ma}}\) parameter of formula (33.1).

Output:

an object of class agent, consisting of a list of an array counts and three vectors alphas, auxalphas, palphas.

infer(agent, predictand=NULL, predictor=NULL)

Arguments:

agent: an agent object.
predictand: a vector of strings with the names of variates.
predictor: either a list of elements of the form variate=value, or a corresponding one-row data.table.

Output:

the joint probability distribution \(\mathrm{P}(\mathit{predictand} \nonscript\:\vert\nonscript\:\mathopen{} \mathit{predictor}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;values;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\) for all possible values of the predictands.

Notes:

If predictors is present, the agent is acting as a “supervised-learning” algorithm. Otherwise it is acting as an “unsupervised-learning” algorithm. The obtained probabilities could be used to generate a new unit similar to the ones observed.
If predictand is missing, the predictands are taken to be all variates not listed among the predictors (hence all variates, if no predictors are given).
The variate names in the predictand and predictor inputs must match some variate names known to the agent. Unknown variate names are discarded. The function gives an error if predictand and predictor have variates in common.

decide(probs, utils=NULL)

Arguments:

probs: a probability distribution for one or more variates.
utils: a named matrix or array of utilities. The rows of the matrix correspond to the available decisions, the columns or remaining array dimensions correspond to the possible values of the predictand variates.

Output:

a list of elements EUs and optimal:

EUs is a vector containing the expected utilities of all decisions, sorted from highest to lowest
optimal is the decision having maximal expected utility, or one of them, if more than one, selected with equal probability

Notes:

If utils is missing or NULL, a matrix of the form \(\begin{bsmallmatrix}1&0&\dotso\\0&1&\dotso\\\dotso&\dotso&\dotso\end{bsmallmatrix}\) is assumed (which corresponds to using accuracy as evaluation metric).

( Further documentation will be added )

rF(n=1, agent, predictand=NULL, predictor=NULL): (generate population-frequency samples)
rF(n=1, agent, predictand=NULL, predictor=NULL): (generate population-frequency samples)
mutualinfo(agent, A, B, base=2): (calculate mutual information)

34.2 Typical workflow

The workflow discussed here is just a guideline and reminder of important steps to be taken when applying an optimal agent to a given task. There cannot be more than a guideline, because each data-science and engineering problem is unique. Literally following some predefined, abstract workflow typically leads to sub-optimal results. Sub-optimal results can be acceptable in some unimportant problems, but are unacceptable in important problems, where, say, people’s lives can be involved, such as medical ones.

We can roughly identify four main stages:

Define the task

In this stage we clarify what the task to be solved is – and why. Asking “why” often reveals the true needs and goals underlying the problem. If possible, the task is formalized. For example, the formal notions introduced in the parts Data I and Data II might be used: a specific statistical population is specified, with well-defined units and variates, and so on.

We often have to get back to his initial stage, as the side arrows in the above flow diagram indicate. New findings in the data or unavoidable limitations in the algorithms used may lead us to re-examine our assumptions and facts, or to re-define our goals (and sometimes to give up!).

Collect & prepare background info

Background and metadata information, as well as auxiliary assumptions, are collected, examined, prepared. Remember that this kind of information is required in order to make sense of the data (§ 27.4). In this stage we ask questions such as “Is our belief about the task exchangeable?”, “Can the statistical population be considered infinite?”, and similar question that make clear which kinds of ready-made methods and approximations are acceptable or not. This stage also helps for correcting possible deficiencies in the training data used in the next stage. For instance, some possible variate values might not appear in the training data, owing to their rarity in the statistical population.

In this stage it is especially important to specify:
- definition of units (what counts as “unit” and can be used as training data?)
- definition of variates and their domains
- initial probabilities
- possible decisions that may be required in the repeated task applications
- utilities associated with the decisions above

Collect & prepare training data

Units similar to the units of our future inferences, but of which we have more complete information, are collected and examined. These are the “training data”. They are used in the next step to make the agent learn from examples. The problematic notion of similarity was discussed in § 20.2: what counts as “similar” is difficult to decide, and often we shall have to revise our decision. Sometimes no units satisfactorily similar to those of interest are available. In this case we must assess which of their informational relationships can be considered similar, which may lead us to use the agent in slightly different ways. We must also check whether training units with partially missing variates can be used by our agent or not.

How many training data?

“Data augmentation” is something necessary with particular machine-learning algorithms as a way to compensate or correct their internal background information, but does not apply to an optimal agent.

The question “how many training data should we use?” does not make sense for an optimal agent which works according to probability theory and decision theory. The answer is simply “as many as you have available”.

If no or few training data are available, then the optimal agent will automatically absorb as much information as possible from them, and combine it with its background information to draw optimal inferences and make optimal decisions.

Giving artificial training data to the agent, just to increase the number of training data, is pointless and dangerous.

Pointless, because the agent is, in fact, automatically “simulating” artificial data internally as needed, from its background information and the real training data available. This is exactly what the belief distribution \(\mathrm{p}(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{34,136,51}data},\mathsfit{I}_{\textrm{d}})\) is doing: remember that the agent is internally considering all possible populations of data (chapter 30).

Dangerous, because artificial data may contain incorrect information, leading the agent to arrive at sub-optimal and potentially disastrously deceiving results.

Prepare OPM agent

The background information and training data (if any available) are finally fed to the agent.

Repeated application

Inferences are drawn, and decision made, for each new application instance. With our prototype agent, the inferences and the decisions can in principle be different from instance to instance.

Every new application can be broken down into several steps:

( Remaining steps to be added soon )