11 Second connection with machine learning
\[ \DeclarePairedDelimiters{\set}{\{}{\}} \DeclareMathOperator*{\argmax}{argmax} \]
In these first chapters we have been developing notions and methods about agents that draw inferences and make decisions, sentences expressing facts and information, and probabilities expressing uncertainty and certainty. Let’s draw some first qualitative connections between these notions and notions typically used in machine learning.
A machine-learning algorithm is usually presented in textbooks as something that first “learns” from some training data, and thereafter performs some kind of task – typically it yields a response or outcome, for example a label, of some kind. More precisely, the training data are instances or examples of the task that the algorithm is expected to perform. These instances have a special status because their details are fully known, whereas new instances, where the algorithm will be applied, have some uncertain aspects. A new instance typically has an ideal or optimal outcome, for example “choosing the correct label”, but this outcome is unknown beforehand. The response given by the algorithm in new instances depends on the algorithm’s internal architecture and parameters (for brevity we shall just use “architecture” to mean both).
Let’s try to rephrase this description from the point of view of the previous chapters. A machine-learning algorithm is given known pieces of information (the training data), and then forms some kind of connection with a new piece of information of similar kind (the outcome in a new application) that was not known beforehand. The connection depends on the algorithm’s architecture.
11.1 “Learning” and “output” from the point of view of inference & decision
The remarks above reveal similarities with what an agent does when drawing an inference: it uses known pieces of information, expressed by sentences \({\color[RGB]{34,136,51}\mathsfit{D}_1}, {\color[RGB]{34,136,51}\mathsfit{D}_2}, {\color[RGB]{34,136,51}\dots}, {\color[RGB]{34,136,51}\mathsfit{D}_N}\), together with some background or built-in information \(\color[RGB]{204,187,68}\mathsfit{I}\), in order to calculate the probability of a new piece of information of a similar kind, expressed by a sentence \(\color[RGB]{238,102,119}\mathsfit{D}_{N+1}\):
\[ \mathrm{P}(\color[RGB]{238,102,119}\mathsfit{D}_{N+1}\color[RGB]{0,0,0}\nonscript\:\vert\nonscript\:\mathopen{} \color[RGB]{34,136,51}\mathsfit{D}_{N} \land \dotsb \land \mathsfit{D}_2 \land \mathsfit{D}_1 \color[RGB]{0,0,0}\land {\color[RGB]{204,187,68}\mathsfit{I}}) \]
We can thus consider a first tentative correspondence:
\[ \mathrm{P}(\underbracket[0ex]{\color[RGB]{238,102,119}\mathsfit{D}_{N+1}}_{\mathclap{\color[RGB]{238,102,119}\text{outcome?}}} \nonscript\:\vert\nonscript\:\mathopen{} \color[RGB]{34,136,51}\underbracket[0ex]{\mathsfit{D}_N \land \dotsb \land \mathsfit{D}_2 \land \mathsfit{D}_1}_{\mathclap{\color[RGB]{34,136,51}\text{training data?}}} \color[RGB]{0,0,0}\land \underbracket[0ex]{\color[RGB]{204,187,68}\mathsfit{I}}_{\mathrlap{\color[RGB]{204,187,68}\uparrow\ \text{architecture?}}}) \]
This correspondence seems convincing for architecture and training data: in both cases we’re speaking about the use of pre-existing or built-in information, combined with additional information.
But the correspondence is less convincing with regard to the outcome. The “agents” that we have envisioned find the probabilities for several possible “outcomes” or “outputs”; they don’t yield only one output. This indicates that there must also be some decision involved among the possible outcomes.
We’ll return to this tentative connection later.
11.2 Why different outputs?
In the previous chapters we have seen, over and over, what was claimed at the beginning of these lecture notes: that an inference & decision problem has only one optimal solution. Once we specify the utilities and the initial probabilities of the problem, the fundamental rules of inference and the principle of maximal expected utility lead to one unique answer (unless, of course, there are several optimal ones with equal expected utilities).
Different machine-learning algorithms, trained with the same training data, often give different answers or outputs to the same problem. Where do these differences come from? From the point of view of decision theory there are three possibilities, which don’t exclude one another:
The initial probabilities given to the algorithms are different. Since the training data are the same, this means that the background information built into one machine-learning algorithm is different from those built into another.
It is therefore important to understand what are the built-in background information and initial probabilities of different machine-learning algorithms. The built-in assumptions of an algorithm must match those of the real problem as closely as possible, in order to avoid sub-optimal or even disastrously wrong answers and outputs.
The utilities built into one machine-learning algorithm are different from those built into another.
It is therefore also important to understand what are the built-in utilities of different machine-learning algorithms. The built-in utilities must also match those of the real problem as closely as possible.
The calculations made by the algorithms are approximate, and different algorithms use different approximations. This means that the algorithms don’t arrive at the unique answer determined by decision theory, but to some other answers which may be approximately close to the optimal one – or not!
It is therefore important to understand what are the calculation approximations made by different machine-learning algorithms. Some approximations may be too crude for some real problems, and may again lead to sub-optimal or even disastrously wrong answers and outputs.
11.3 Data pre-processing and the data-processing inequality
“Data pre-processing” is a collective name given to very different operations on data before they are used in some algorithm to solve a decision or inference problem. Some of these operations are often said to be essential for the solution of these problems. This statement in not completely true, and needs qualification.
We can divide pre-processing procedures in roughly three categories:
- Inconsistency checks
- Procedures in this category make sure that the data are what they were intended to be. For instance, if data should consist of the power outputs of several engines, but one datapoint is the physical weight of an engine, then that “datapoint” is actually no data at all for the present problem. It’s something included by mistake and should be removed. Such procedures are necessary and useful, but they are just consistency checks and do not change the information contained in the proper data.
- Formatting
- These procedures make sure that data are in the correct format to be inputted into the algorithm. They may also include rescaling of numerical values for avoiding numerical overflow or underflow errors during computation. Such procedures are often necessary and useful, but they just change the way data are encoded. They do not actually change the information contained in the data.
- “Mutilation” or information-alteration
- Procedures of this kind alter the content of data. For instance, such a procedure may replace, in a dataset of temperatures, a datapoint having value 20 °C with one having value 25 °C; this is not just a simple rescaling. Procedures of this kind include “de-noising”, “de-biasing”, “de-trending”, “filtering”, “dimensionality reduction” and similar ones (often having noble-sounding names). We must state, clearly and strongly, that within Decision Theory and Probability Theory, such information-altering pre-processing is not necessary , and is in fact detrimental. This is why we call it “mutilation” here.
It is important that you understand that such data pre-processing is not something that one, in principle, has to do in data science. Quite the opposite, in principle we should not do it, because it is a destructive procedure. Such pre-processing is done in order to correct deficiencies of the algorithms currently in use, as discussed below.
If we build an “optimal predictor machine” that fully operates according to the four rules of inference (§ 8.4) and of maximization of expected utility, then the data fed into this machine should not be pre-processed with any information-altering procedures. The reason is that the four fundamental rules automatically take care, in an optimal way, of factors such as noise, bias, systematic errors, redundancy. We briefly discussed this fact in § 9.7 and saw a simple example of how redundancy is accounted for by the four inference rules.
If we have information about noise or other factors affecting the data, then we should include this information in the background information provided to the “optimal predictor machine”, rather than altering the data given to it. The reason, in intuitive terms, is that the machine does the adjustments while fully exploring the data themselves; so it can “see” more deeply how to make optimal adjustments given the “inner structure” of the data. In the pre-processing phase – as the prefix “pre-” indicates – we don’t have the full picture about the data, so any adjustment risks to eliminate actually useful information.
More formally, this is the content of the data-processing inequality from information theory:
There are two main, partially connected reasons why one performs “mutilating” pre-processing of data:
The algorithm used is non-optimal: it’s only using approximations of the four fundamental rules, and therefore cannot remove noise, bias, redundancies, or similar factors in an optimal way, or at all. In this case, pre-processing is an approximate way of correcting the deficiency of the non-optimal algorithm.
Full optimal processing is computationally too expensive. In this case we try to simplify the optimal calculation by doing, in advance and in a cruder, faster way, some of the “cleaning” that the full calculation would otherwise spend time doing in an optimal way.