2 Framework
\[ \DeclarePairedDelimiters{\set}{\{}{\}} \DeclareMathOperator*{\argmax}{argmax} \]
2.1 What does the intro problem tell us?
Let’s approach the “accept or discard?” problem of the previous chapter 1 in an intuitive way.
We’re jumping the gun here, because we haven’t learned the method to solve this problem yet!
First, what happens if we accept
the component?
We must try to make sense of the \(10\%\) probability that the component fails within a year. For the moment let’s use an intuitive imagination trick to make sense of this. Imagine that this situation is repeated 100 times. In 10 of these repetitions the accepted electronic component is sold and fails within a year after selling. In the remaining 90 repetitions, the component is sold and works fine for at least a year. Later on we’ll approach this in a more rigorous way, where the idea of “imaginary repetitions” is not needed.
In each of the 10 imaginary repetitions in which the component fails early, the manufacturer loses \(\color[RGB]{238,102,119}11\$\). That’s a total loss of \({\color[RGB]{204,187,68}10} \cdot {\color[RGB]{238,102,119}11\$} = {\color[RGB]{238,102,119}110\$}\). In each of the 90 imaginary repetitions in which the component doesn’t fail early, the manufacturer gains \(\color[RGB]{34,136,51}1\$\). That’s a total gain of \({\color[RGB]{204,187,68}90} \cdot {\color[RGB]{34,136,51}1\$} = {\color[RGB]{34,136,51}90\$}\). So over all 100 imaginary repetitions the manufacturer gains
\[ {\color[RGB]{204,187,68}10}\cdot ({\color[RGB]{238,102,119}-11\$}) + {\color[RGB]{204,187,68}90}\cdot {\color[RGB]{34,136,51}1\$} = {\color[RGB]{238,102,119}-20\$} \]
that is, the manufacturer has not gained, but lost \(20\$\)! That’s an average of \(0.2\$\) lost per repetition.
Now let’s examine the second choice: what happens if we discard
the component instead?
In this case we don’t need to invoke imaginary repetitions, but even if we do, it’s clear that the manufacturer doesn’t gain or lose anything – that is, the “gain” is \(0\$\) – in each and all of the repetitions.
The conclusion is that if in a situation like this we accept the component, then we’ll lose \(0.2\$\) on average; whereas if we discard it, then on average we’ll lose (or gain) \(0\$\) on average.
Obviously the best, or “least worst”, decision to make is to discard the component.
From the solution of the problem and from the exploring exercises, we gather some instructive points:
Is it enough if we simply know that the component is less likely to fail than not? in other words, if we simply know that the probability of failure is less than 50% but don’t know its precise value?
Obviously not. We found that if the failure probability is \(10\%\) then it’s best to discard; but if it’s \(5\%\) then it’s best to accept. In both cases the probability of failure was less than 50%, but the decisions were different. Moreover, we found that the probability affected amount of loss if one made the non-optimal decision. Therefore:
Knowledge of precise probabilities is absolutely necessary for making the best decision
Is it enough if we simply know that failure leads to a loss, and non-failure leads to a gain, without knowing the precise amounts of loss and gain?
Obviously not. In the exercise we found that if the failure cost is \(11\$\) then it’s best to discard; but if it’s \(5\$\) then it’s best to accept. It’s also best to accept if the failure cost is \(11\$\) but the non-failure gain is \(2\$\). Therefore:
Knowledge of the precise gains and losses is absolutely necessary for making the best decision
Is this kind of decision situation only relevant to assembly lines and sales?
By all means not. We found a clinical situation that’s exactly analogous: there’s uncertainty, there are gains and losses (of lifetime rather than money), and the best decision depends on both.
2.2 Our focus: decision-making, inference, and data science
Every data-driven engineering project is unique, with its unique difficulties and problems. But there are also problems common to all engineering projects.
In the scenarios we explored above, we found an extremely important problem-pattern. There is a decision or choice to make (and “not deciding” is not an option, or it’s just another kind of choice). Making a particular decision will lead to some consequences, some leading to something desirable, others leading to something undesirable. The decision is difficult because its consequences are not known with certainty, given the information and data available in the problem. We may lack information and data about past or present details, about future events and responses, and so on. This is what we call a problem of decision-making under uncertainty or under risk1, or simply a “decision problem” for short.
1 We’ll avoid the word “risk” because it has several different technical meanings in the literature, some even contradictory.
This problem-pattern appears literally everywhere. But our explored scenarios also suggest that this problem-pattern has a sort of systematic solution method.
In this course we’re going to focus on decision problems and their systematic solution method. We’ll learn a framework and some abstract notions that allow us to frame and analyse this kind of problem, and we’ll learn a universal set of principles to solve it. This set of principles goes under the name of Decision Theory.
But what do decision-making under uncertainty and Decision Theory have to do with data and data science? The three are profoundly and tightly related on many different planes:
Data science is based on the laws of Decision Theory. These laws are similar to what the laws of physics are to a rocket engineer. Failure to account for these fundamental laws leads at best to sub-optimal solutions, at worst to disasters.
Machine-learning algorithms, in particular, are realizations or approximations of the rules of Decision Theory. This is clear, for instance, considering that a machine-learning classifier is actually choosing among possible output labels or classes.
The rules of Decision Theory are also the foundations upon which artificial-intelligence agents, which must make optimal inferences and decisions, are built.
We saw that probability values are essential to a decision problem. How do we find them? Data play an important part in their calculation. In our intro example, the failure probability must come from observations or experiments on similar electronic components.
We saw that also the values of gains and losses are essential. Data play an important part in their calculation as well.
These five planes will constitute the major parts and motivations of the present course.
There are other important aspects in engineering problems, besides the one of making decisions under uncertainty. For instance the discovery or the invention of new technologies and solutions. Aspects such as these two can barely be planned or decided. Their drive and direction, however, rest on a strive for improvement and optimization – and the fundamental laws of decision theory tell us what’s optimal and what’s not.
Artificial intelligence is proving to be a valuable aid in these more creative aspects too. This kind of use of AI is outside the scope of the present notes. Some aspects of this creativity-assisting use, however, do fall within the domain of the present notes. A pattern-searching algorithm, for example, can be optimized by means of the method we are going to study.
2.3 Our goal: optimality, not “success”
What should we demand from a systematic method for solving decision problems?
By definition, in a decision problem under uncertainty there is generally no method to determine the decision that surely leads to the desired consequence – if such a method existed, then the problem would not have any uncertainty! Therefore, if there is a method to deal with decision problems, its goal cannot be the determination of the successful decision. This also means that a priori we cannot blame an engineer for making an unsuccessful decision in a situation of uncertainty. Then what should be the goal of such a method?
Imagine two persons, Henry and Tina, who must choose between a “heads-bet” or a “tails-bet” before a coin is tossed; the bets work as follows:
“heads-bet”: if the coin lands heads, the person wins a small amount of money; but if it lands tails, they lose a large amount of money
“tails-bet”: if the coin lands tails, the person wins a small amount of money; if it lands heads, they lose the same small amount of money
Henry chooses the heads-bet. Tina chooses the tails-bet. The coin comes down heads. So Henry wins the small amount of money, while Tina loses the same small amount.
What would we say about their decisions?
Henry’s decision was lucky, and yet irrational: he risked losing much more money than he could win. Tina’s decision was unlucky, and yet rational: she wasn’t risking to lose more than she could win. Said otherwise, the two bets had the same winning prospects, but the heads-bet had more risk of loss than the tails-bet.
We expect that any person making Henry’s decision in similar, future bets will eventually lose more money than any person making Tina’s decision.
This example shows two points:
“success” is generally not a good criterion to judge a decision under uncertainty; success can be the pure outcome of luck, not of smarts
even if there is no method to determine which decision is successful, there is a method to determine which decision is rational or optimal, given the particular gains, losses, and uncertainties involved in the decision problem
We had a glimpse of this method in our introductory scenarios.
Let us emphasize, however, that we are not giving up on “success”, or trading it for “optimality”. Indeed we’ll find that Decision Theory automatically leads to the successful decision in problems where uncertainty is not present or is irrelevant. It’s a win-win. It’s important to keep this point in mind:
We shall later witness this fact with our own eyes, and will take it up again in the discussion of some misleading techniques to evaluate machine-learning algorithms.
2.4 Decision Theory
So far we have mentioned that Decision Theory has the following features:
it tells us what’s optimal and, when possible, what’s successful
it takes into consideration decisions, consequences, costs and gains
it is able to deal with uncertainties
What other kinds of features should we demand from it, in order to be applied to as many kinds of decision problems as possible, and to be relevant for data science?
If we find an optimal decision in regards to some outcome, it may still happen that the decision can be realized in several ways that are equivalent in regard to the outcome, but inequivalent in regard to time or resources. In the assembly-line scenario, for example, the decision
discard
could be carried out by burning, recycling, and so on. We thus face a decision within a decision. In general, a decision problem may involve several decision sub-problems, in turn involving decision sub-sub-problems, and so on.In data science, a common engineering goal is to design and build an automated or AI-based device capable of making an optimal decision in a specific kind of uncertain situations. Think for instance of an aeronautic engineer designing an autopilot system, or a software company designing an image classifier.
Decision Theory turns out to meet these two demands too, thanks to the following features:
it is susceptible to recursive, sequential, and modular application
it can be used not only for human decision-makers, but also for automated or AI devices
Decision Theory has a long history, going back to Leibniz in the 1600s and partly even to Aristotle in the −300s, and appearing in its present form around 1920–1960. What’s remarkable about it is that it is not only a framework, but the framework we must use. A logico-mathematical theorem shows that any framework that does not break basic optimality and rationality criteria has to be equivalent to Decision Theory. In other words, any “alternative” framework may use different technical terminology and rewrite mathematical operations in a different way, but it boils down to the same notions and operations of Decision Theory. So if you wanted to invent and use another framework, then either (a) it would lead to some irrational or illogical consequences, or (b) it would lead to results identical to Decision Theory’s. Many frameworks that you are probably familiar with, such as optimization theory or Boolean logic, are just specific applications or particular cases of Decision Theory.
Thus we list one more important characteristic of Decision Theory:
- it is normative
Normative contrasts with descriptive. The purpose of Decision Theory is not to describe, for example, how human decision-makers typically make decisions. Because human decision-makers typically make irrational, sub-optimal, or biased decisions. That’s exactly what we want to avoid and improve!