40  Evaluation practices and utilities

Published

2024-09-28

40.1 Confusion matrices and evaluation metrics

Machine-learning methodology uses a disconcerting variety of evaluation metrics to try to quantify, compare, rank the performances of one or more algorithms.

For machine-learning classifiers many of these metrics are constructed from the so-called “confusion matrix”. The basic idea behind it is to test the algorithms of interest on a test dataset (the same for all), and then count for how many units the algorithm outputs value \(\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\color[RGB]{238,102,119}y^{*}}\) when the true value of the unknown is \(\color[RGB]{238,102,119}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y\), for all combinations of possible \({\color[RGB]{238,102,119}y^{*}}\) and \({\color[RGB]{238,102,119}y}\).

Imagine for instance a binary-classification task with classes \({\color[RGB]{238,102,119}\alpha}\) and \({\color[RGB]{238,102,119}\beta}\). It could be the electronic-component scenario we met in the first chapters, with class \({\color[RGB]{238,102,119}\alpha}\) being “the electronic component will function for at least a year”, and class \({\color[RGB]{238,102,119}\beta}\)the electronic component will fail within a year of use”.

Application of two algorithms \(\mathcal{A}\) and \(\mathcal{B}\) to the same 100 test units results in the following confusion matrices:

Table 40.1: Confusion matrix for \(\mathcal{A}\)
true \({\color[RGB]{238,102,119}\alpha}\) true \({\color[RGB]{238,102,119}\beta}\)
output \({\color[RGB]{238,102,119}\alpha}\) 27 15
output \({\color[RGB]{238,102,119}\beta}\) 23 35
Table 40.2: Confusion matrix for \(\mathcal{B}\)
true \({\color[RGB]{238,102,119}\alpha}\) true \({\color[RGB]{238,102,119}\beta}\)
output \({\color[RGB]{238,102,119}\alpha}\) 43 18
output \({\color[RGB]{238,102,119}\beta}\) 7 32

These matrices can also be normalized, dividing every entry by the total number of test units. Various aspects can be read from the confusion matrices above. Algorithm \(\mathcal{B}\), for example, seems better than \(\mathcal{A}\) at inferring class \({\color[RGB]{238,102,119}\alpha}\), but slightly worse at inferring class \({\color[RGB]{238,102,119}\beta}\).

Most evaluation metrics for classification combine the entries of a confusion matrix in a mathematical formula that yields a single number.

Such metrics and formulae can be quite opaque. Their definitions and their motivations are often arbitrary or very case-specific. It’s common to find works where several metrics are used because it’s unclear which single one should be used. And the only hope is that most or all of them will agree at least on the rankings they lead to. However,…

Majority votes are not a criterion for correctness

Here are the scores that some popular metrics assign to algorithms \(\mathcal{A}\) and \(\mathcal{B}\), based on their confusion matrices above. The better algorithm according to each metric is indicated in green bold. Almost all these evaluation metrics seem to agree that \(\mathcal{B}\) should be the best of the two:

metric \(\mathcal{A}\) \(\mathcal{B}\)
Accuracy 0.62 0.75
Precision 0.64 0.70
Balanced Accuracy 0.62 0.75
\(F_1\) measure 0.59 0.77
Matthews Correlation Coefficient 0.24 0.51
Fowlkes-Mallows index 0.59 0.78
True-positive rate (recall) 0.54 0.86
True-negative rate (specificity) 0.70 0.64

Yet, when put to actual use, it turns out that \(\mathcal{B}\) actually leads to a monetary loss of 3.5 $ per unit, whereas \(\mathcal{A}\) leads to a monetary gain of 3.5 $ per unit!

40.2 Decision theory and utilities as the basis for evaluation

How is the surprising result above possible? Decision Theory tells us how, and can even correctly select the best algorithm from the confusion matrices above.

As discussed in the preceding sections, each application to a new unit is a decision-making problem. Neither making nor evaluating a decision is possible unless we specify the utilities relevant to the problem. It turns out that the consequences in the present problem have the following monetary utilities:

\({\color[RGB]{238,102,119}\alpha}\) is true \({\color[RGB]{238,102,119}\beta}\) is true
\({\color[RGB]{238,102,119}\alpha}\) is chosen \(\color[RGB]{68,119,170}15 \$\) \(\color[RGB]{68,119,170}-335 \$\)
\({\color[RGB]{238,102,119}\beta}\) is chosen \(\color[RGB]{68,119,170}-35 \$\) \(\color[RGB]{68,119,170}165 \$\)

These utility values may be determined by the combination of sale revenue, disposal costs, warranty refunds, and similar factors.

The utility matrix above says that for every test unit of true class \({\color[RGB]{238,102,119}\alpha}\) (the unit will function for at least a year) that the algorithm classifies as \({\color[RGB]{238,102,119}\alpha}\) (and therefore sends for sale), the production company gains \(\color[RGB]{68,119,170}15 \$\); for every test unit of class \({\color[RGB]{238,102,119}\alpha}\) that the algorithm classifies as \({\color[RGB]{238,102,119}\beta}\) (and therefore discards), the production company gains \(\color[RGB]{68,119,170}-35 \$\) (so actually a loss); and so on. The total yield from each algorithm on the 100 test units can therefore be calculated by multiplying the four utilities above by the corresponding counts in the confusion matrix, and then taking the total. The average yield per unit is obtained dividing the total by the number of units, or directly using the normalized confusion matrix:

\[ \begin{aligned} \text{$\mathcal{A}$'s average yield } &= \bigl[27 \cdot {\color[RGB]{68,119,170}15 \$} + 23 \cdot ({\color[RGB]{68,119,170}-35 \$}) + 15 \cdot ({\color[RGB]{68,119,170}-335 \$}) + 35 \cdot {\color[RGB]{68,119,170}165 \$}\bigr]/100 \\[1ex] &= \boldsymbol{\color[RGB]{34,136,51}+3.5 \$} \\[3ex] \text{$\mathcal{B}$'s average yield } &= \bigl[43 \cdot {\color[RGB]{68,119,170}15 \$} + 7 \cdot ({\color[RGB]{68,119,170}-35 \$}) + 18 \cdot ({\color[RGB]{68,119,170}-335 \$}) + 32 \cdot {\color[RGB]{68,119,170}165 \$}\bigr]/100 \\[1ex] &= {\color[RGB]{170,51,119}-3.5 \$} \end{aligned} \]

This calculation tells us that \(\mathcal{A}\) is better, and also gives us an idea of the actual utilities that the two algorithms would yield.

Utilities as evaluation metric

The calculation above is exactly the one discussed in § 36.5, where we logically motivated the use of utilities as an evaluation metric. In the case of common machine-learning classifiers the logically correct evaluation metric and procedure are is thus straightforward:

  1. find the utilities relevant to the specific problem under consideration, collect them into a utility matrix \(\boldsymbol{\color[RGB]{68,119,170}U}\)
  2. apply the classifier to a test set and build the confusion matrix \(\boldsymbol{\color[RGB]{238,102,119}C}\)
  3. multiply confusion and utility matrices element-wise and take the total

What’s remarkable in this procedure is that it is not only logically well-founded, but also mathematically simple. The formula for the correct evaluation metric is just a linear combination of the confusion-matrix entries. This linearity is actually a subtle consequence of the axiom of independence discussed in § 36.6.

Note something very important:

Don’t do class balancing!

This procedure requires that the test set be a representative, unsystematic sample of the actual population of interest. In particular, the proportions of classes in the test set should reflect their frequencies in real application. No “class balancing” should be performed.

The evaluation based on decision theory automatically takes into account “class balance” and its interaction with the utilities relevant to the task.

There are problems for which the simple strategy “always choose class \(\dotso\)” is optimal, given the predictors available in the problem. So this strategy cannot be beaten by “improving” or finding a “better” algorithm: such endeavour is only a waste of time. The reason is that the maximal information that the predictors can give about the predictand is not enough to sharpen probabilities above the thresholds determined by the utilities. This maximal information is an intrinsic properties of predictors and predictands, so it cannot be improved by fiddling with algorithms. The only way for improvement is to find other predictors.

“Class balancing” does not solve this problem. It transforms the actual task into a different task, where maybe some algorithm can show improvement, simply because we have changed the population statistics and therefore the mutual information between predictors and predictands. But as soon as we get back to reality, to the actual task, the situation will be as before.

40.3 Popular evaluation metrics: the good and the bad

If an evaluation metric can be rewritten as a linear combination of the confusion-matrix entries, then it can be interpreted as arising from a set of utilities (although they might not be the ones appropriate to the problem).

This is the case, for instance, of

  • accuracy, which turns out to correspond to the utility matrix \(\begin{bsmallmatrix} 1&0\\0&1 \end{bsmallmatrix}\) or its non-binary analogues;
  • true-positive rate, which corresponds to \(\begin{bsmallmatrix} 1&0\\0&0 \end{bsmallmatrix}\)
  • true-negative rate, which corresponds to \(\begin{bsmallmatrix} 0&0\\0&1 \end{bsmallmatrix}\)

If an evaluation metric cannot be rewritten in such a linear form, then it is breaking the axioms of Decision Theory, and is therefore guaranteed to carry some form of cognitive bias. The axiom of independence is specifically broken, because non-linearities imply some functional dependence of utilities on probabilities or vice versa.

Some quite popular evaluation metrics turn out to break Decision Theory in this way:

  • precision
  • \(F_1\)-measure
  • Matthews correlation coefficient
  • Fowlkes-Mallows index
  • balanced accuracy

The area under the curve of the receiving operating characteristic (typically denoted “AUC”) is also an evaluation metric that breaks the axioms of Decision Theory, although it is not based on the confusion matrix.

This fact is actually funny, because the first papers (in the 1960s–1970s, referenced below) that discussed an evaluation method based on the receiver operating characteristic actually derived it from Decision Theory. The papers gave the correct procedure to use the receiver operating characteristic, and pointed out the “area under the curve” only as a quick but possibly erroneous heuristic procedure.

It goes without saying that you should stay away from the cognitive-biased metrics above.