11  Second connection with machine learning

Published

2023-11-16

In these first chapters we have been developing notions and methods about agents drawing inferences and making decisions, sentences expressing facts and information, and probabilities expressing uncertainty and certainty. Let’s draw some first qualitative connections between these notions and notions typically used in machine learning.

A machine-learning algorithm is usually presented in textbooks as something that first “learns” from some training data, and thereafter performs some kind of task – typically it yields a response or outcome of some kind. More precisely, the training data are instances or examples of the task that the algorithm is expected to perform. These instances have a special status because their details are fully known, whereas new instances, where the algorithm will be applied, have some uncertain elements: typically they have an ideal or optimal outcome, but this outcome is unknown beforehand. The response given by the algorithm in new instances depend on the algorithm’s internal architecture and parameters; for brevity we shall just use “architecture” to mean both.

Let’s try to rephrase this description from the point of view developed in the past chapters. A machine-learning algorithm is given known pieces of information (the training data), and then forms some kind of connection with another piece of information of a similar kind (the outcome in a new application) that was not known beforehand. The connection depends on the algorithm’s architecture.

11.1 “Learning” and “output” from the point of view of inference & decision

The remarks above reveal similarities with what an agent does when drawing an inference: it uses known pieces of information, expressed by sentences \({\color[RGB]{34,136,51}\mathsfit{D}_1}, {\color[RGB]{34,136,51}\mathsfit{D}_2}, {\color[RGB]{34,136,51}\dots}, {\color[RGB]{34,136,51}\mathsfit{D}_N}\), together with some background or built-in information \(\color[RGB]{204,187,68}\mathsfit{I}\), in order to calculate the probability of a piece of information of a similar kind, expressed by a sentence \(\color[RGB]{238,102,119}\mathsfit{D}_{N+1}\):

\[ \mathrm{P}(\color[RGB]{238,102,119}\mathsfit{D}_{N+1}\color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{} \color[RGB]{34,136,51}\mathsfit{D}_{N} \land \dotsb \land \mathsfit{D}_2 \land \mathsfit{D}_1 \color[RGB]{0,0,0}\land {\color[RGB]{204,187,68}\mathsfit{I}}) \]

We can thus consider a first tentative correspondence:

\[ \mathrm{P}(\underbracket[0ex]{\color[RGB]{238,102,119}\mathsfit{D}_{N+1}}_{\mathclap{\color[RGB]{238,102,119}\text{outcome?}}} \nonscript\:\vert\nonscript\:\mathopen{} \color[RGB]{34,136,51}\underbracket[0ex]{\mathsfit{D}_N \land \dotsb \land \mathsfit{D}_2 \land \mathsfit{D}_1}_{\mathclap{\color[RGB]{34,136,51}\text{training data?}}} \color[RGB]{0,0,0}\land \underbracket[0ex]{\color[RGB]{204,187,68}\mathsfit{I}}_{\mathrlap{\color[RGB]{204,187,68}\uparrow\ \text{architecture?}}}) \]

This correspondence seems convincing with regard to architecture and training data: in both cases we’re speaking about the use of pre-existing or built-in information, combined with additional one.

But it is less convincing with regarding to the outcome, because an agent gives the probabilities for several possible “outputs”, it doesn’t just yield one. This indicates that there must be also be some decision involved among the possible outcomes.

We’ll return to this tentative connection later.

11.2 Why different outputs?

In the past chapters we have seen, over and over, what was claimed in the introduction to the present lecture notes: that an inference & decision problem has only one optimal solution. Once we specify the utilities and the initial probabilities of the problem, the fundamental rules of inference and the principle of maximal expected utility leads to one unique answer (unless, of course, there are several optimal ones with equal expected utilities).

Different machine-learning algorithms, trained with the same training data, often give different answers or outputs to the same problem. Where do these differences come from? From the point of view of decision theory there are three possibilities, which don’t exclude one another:

  • The initial probabilities given to the algorithms are different. Since the training data are the same, this means that the background information built into one machine-learning algorithm is different from those built into another.

    It is therefore important to understand what are the built-in background information and initial probabilities of different machine-learning algorithms. The built-in assumptions of an algorithm must match those of the real problem as closely as possible, in order to avoid sub-optimal or even disastrously wrong answers and outputs.

  • The utilities built into one machine-learning algorithm are different from those built into another.

    It is therefore also important to understand what are the built-in utilities of different machine-learning algorithms. The built-in utilities must also match those of the real problem as closely as possible.

  • The calculations made by the algorithms are approximate, and different algorithms make different kinds of approximations. This means that the algorithms don’t arrive at the unique answer determined by decision theory, but to some other answers which may be approximately close to the correct one and to one another – or not.

    It is therefore important to understand what are the calculation approximations made by different machine-learning algorithms. Some approximations may be too crude for some real problems, and may again lead to sub-optimal or even disastrously wrong answers and outputs.

11.3 Data pre-processing and the data-processing inequality

“Data pre-processing” is a collective name given to very different operations on data before they are used in some algorithm to solve a decision or inference problem. Some of these operations are often said to be “essential” or “crucial” for the solution of these problems. This statement in not completely true, and needs qualification.

We can divide pre-processing procedures in roughly three categories:

Inconsistency checks
Procedures in this category make sure that the data are what they were intended to be. For instance, if data should consist of the power outputs of several engines, but one datapoint is the physical weight of an engine, then that “datapoint” is actually no data at all for the present problem. It’s something included by mistake and should be removed. Similarly if, say, data about distances should be expressed in metres, but some turn out to be expressed in kilometres (this case overlaps with formatting procedures described below). Such procedures are necessary and useful, but they are just consistency checks and do not change the information contained in the proper data.
Important

In later chapters we shall say more about some often erroneous procedures, like “tail trimming”, that actually remove proper data, lead to sub-optimal or completely erroneous solutions.

Formatting
These procedures make sure that data are in the correct format to be inputted into the algorithm. They may also include rescaling of numerical values for avoiding numerical overflow or underflow errors during computation. Such procedures are often necessary and useful, but they just change the way data are encoded, possibly including the zero and unit of measurement; they do not actually change the information contained in the data.
“Mutilation” or information-alteration
Procedures of this kind alter the content of proper data. For instance, such a procedure may replace, in a data set of temperatures, a datapoint having value 20 °C with one having value 25 °C; this is not just a simple rescaling. These procedures include “de-noising”, “de-biasing”, “de-trending”, “filtering”, “dimensionality reduction” and similar procedures (often having noble-sounding names). We must state, clearly and strongly, that within Decision Theory and Probability Theory, such information-altering pre-processing is not necessary , and is in fact detrimental; this is why we call it “mutilation” here.

It is important that you understand that such data pre-processing is not something that one has to do in data science in general – quite the opposite, in principle you should not do it, because it is a destructive procedure. Such pre-processing is done in order to correct deficiencies of the algorithms currently in use, as discussed below.

If we build an “optimal predictor machine” that fully operates according to the four rules of inference (§  8.4) and of maximization of expected utility, then the data fed into this machine should not be pre-processed with any information-altering procedures. The reason is that the four fundamental rules automatically take care of factors such as noise, bias, systematic errors, redundancy in the optimal way. We briefly discussed in §  9.3 and saw a simple example of how redundancy is accounted for by the four rules.

If we have information about noise or other factors affecting the data, then we should include this information in the background information provided to the “optimal predictor machine”, rather than altering the data given to it. The reason, in intuitive terms, is that the machine does the adjustments while fully exploring the data themselves, so it can more deeply “see” how to make optimal adjustments given the “inner structure” of the data. In the pre-processing phase – as the prefix “pre-” indicates – we don’t have the full picture about the data, so any adjustment risks eliminating actually useful information and is always sub-optimal.

More formally, this is the content of the data-processing inequality from information theory:

Data-processing inequality

“No clever manipulation of the data can improve the inferences that can be made from the data”
(Elements of Information Theory § 2.8)

or, from the complementary point of view:

“Data processing can only destroy information”
(Information Theory, Inference, and Learning Algorithms exercise 8.9)

Study reading


There are two main, partially connected reasons why one performs “mutilation” pre-processing of data:

  • The algorithm used is non-optimal: it’s only using approximations of the four fundamental rules, and therefore cannot remove noise, bias, redundancies, and similar in an optimal way – or at all. In this case, pre-processing is an approximate way of correcting this deficiency of the non-optimal algorithm.

  • Full optimal processing is computationally too expensive. In this case we try to simplify the optimal calculation by doing in advance and in a cruder, faster way some of the “cleaning” that the full calculation would otherwise spend time doing in an optimal way.