20  Populations and variates

Published

2023-10-31

20.1 Collections of similar quantities: motivation

In the latest chapters we gradually narrowed our focus on a particular kind of inference: inferences that involve collections of similar quantities (each of which can be simple, joint, or complex). “Similar” means that all such quantities have the same domain – for instance, each of them has possible values \(\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}\); or each of them has possible values between \(0\,\mathrm{\textcelsius}\) and \(100\,\mathrm{\textcelsius}\) – and they have a similar meaning and measurement procedure. They can be considered different “instances” of the same quantity, so to speak. We saw an example in the three-patient hospital scenario of §  17.3, with the three “urgency” quantities \(U_1\), \(U_2\), \(U_3\), corresponding to the urgency of three consecutive patients. Here are other examples:

  • Stock exchange
    We are interested in the daily change in closing price of a stock, during 1000 days. Each day the change can be positive (or zero), or negative.
    The daily change on any day can clearly be considered as a binary quantity, say with domain \(\set{{\color[RGB]{34,136,51}{\small\verb;+;}}, {\color[RGB]{204,187,68}{\small\verb;-;}}}\). The daily changes in 1000 days are a set of 1000 binary quantities with exactly the same domain – even if each one can have a different value.


  • Mars prospecting
    Some robot examines 1000 similar-sized rocks in a large crater on Mars. Each rock either contains haematite, or it doesn’t.
    The haematite-content of any rock can be considered as a binary quantity, say with domain \(\set{{\color[RGB]{34,136,51}{\small\verb;Y;}}, {\color[RGB]{238,102,119}{\small\verb;N;}}}\). The haematite contents of the 1000 rocks are again a set of 1000 binary quantities with exactly the same domain – even if each one can have a different value.


  • Glass forensics
    A criminal forensics department has 215 glass fragments collected from many different crime scenes. Each fragment is characterized by a refractive index (between \(1\) and \(\infty\)), a percentage of Calcium (between \(0\%\) and \(100\%\)), a percentage of Silicon (ditto), and a type of origin (for example “from window of building”, “from window of car”, and similar).
    The refractive index, Calcium percentage, Silicon percentage, and type of the 215 fragments are a set of 215 joint quantities with exactly the same joint domain.

It is easy to think of many other and very diverse examples, with even more complex variates, such as images or words. We shall now try to abstract and generalize this similarity.

20.2 Units, variates, statistical populations

Consider a large collection of entities that are somehow similar to one another. We call these entities units. These units could be, for instance:

  • physical objects such as cars, windmills, planets, or rocks from a particular place;
  • creatures such as animals of a particular species, or human beings, maybe with something in common such as geographical region; or plants of a particular kind;
  • automatons having a particular application;
  • software objects such as photos;
  • abstract objects such as functions or graphs;
  • the rolls of a particular die or the tosses of a particular coin;
  • the weather conditions in several days.

These units are similar to one another in that they have some set of attributes1 common to all. These attributes can present themselves in a specific number of mutually-exclusive guises, which can be different from unit to unit. For instance, the attributes could be:

1 The term features is frequently used in machine learning

  • “colour”, each unit being, say, green, blue, or yellow;
  • “mass”, each unit having a mass between \(0.1\,\mathrm{kg}\) and \(10\,\mathrm{kg}\);
  • “health condition”, each unit (an animal or human in this case) being healthy or ill; or maybe being affected by one of a specific set of diseases;
  • containing something, for instance a particular chemical substance;
  • “having a label”, each unit having one of the labels A, B, C;
  • a complex combination of several simpler attributes like the ones above.

The units also have additional attributes (they must, otherwise we wouldn’t be able to distinguish each unit from all others), which we simply don’t consider or can’t measure. We’ll discuss this possibility later.

From this description it’s clear that the attributes of each unit are a (possibly joint) quantity, as defined in §  12.1.1. Once the units and their attributes are specified, we have a set of as many quantities as there are units. All these quantities have identical domains.

A quantity which is a collection of attributes of a set of units is called a variate. So when we speak about a variate it is understood that this is a quantity that appears, replicated, in some set of units.


We call a collection of units so defined a statistical population, or just population when there’s no ambiguity. The number of units is called the size of the population.

The notion of statistical population is extremely general and encompassing: many different things can be thought of as a population. In speaking of “data”, what is often meant is a particular statistical population. The specification of a population requires precision, especially when it is used to draw inferences, as we shall see later. A statistical population has not been properly specified until two things are precisely specified:

  • a way to determine whether something is a unit or not: inclusion and exclusion criteria, means of collection, and so on
  • a definition of the variate considered, its possible values, and how it is measured
Exercises
  • Which of the following descriptions does properly define a statistical population? explain why it does or does not.

    1. People.

    2. Electronic components produced in a specific assembly line, since the line became operational until its discontinuation, and measured for their electric resistance, with possible values in \([0\,\mathrm{\Omega}, \infty\,\mathrm{\Omega}]\), and for their result on a shock test, with possible values \(\set{{\small\verb;pass;}, {\small\verb;fail;}}\).

    3. People born in Norway between 1st January 1990 and 31st December 2010.

    4. The words contained in all websites of the internet.

    5. Rocks, of volume between 1 cm³ and 1 m³, found in the Schiaparelli crater (as defined by contours on a map), and tested to contain haematite, with possible values \(\set{{\small\verb;Y;}, {\small\verb;N;}}\).

  • Browse some datasets at the UC Irvine Machine Learning repository. Each dataset is a statistical population. The variate in most of these populations is a joint variate (to be discussed soon), that is, a collection of several variates.

    Examine and discuss the specification of some of those datasets:

    • Is it well-specified what constitutes a “unit”? Are the criteria for including or excluding datapoints, their origin, and so on, well explained?
    • Are the variates well-defined? Is it explained what they mean, how they were measured, what is their domain, and so on?


Subtleties in the notion of statistical population
  • A statistical population is only a conceptual device for simplifying and facing some decision or inference problem. There is no objectively-defined population “out there”.

    Any entity, object, person, and so on has some characteristics that makes it completely unique (say, its space-time coordinates), otherwise we wouldn’t be able to distinguish it from others. From this point of view any entity is just a one-member population in itself. If we consider two or more entities as being “similar” and belonging to the same population, it’s because we have decided to disregard some characteristics and focus on some others. This decision is arbitrary, a matter of convention, and depends on the specific inference and decision problem.

    To “test” whether an entity belong to a given population, means to check if that entity satisfies the agreed-upon definition of that population.

  • Any physical entity, object, person, etc. can represent different units in different and even non-overlapping statistical populations. For instance, a 100 cm³ rock found in the Schiaparelli crater on Mars could be a unit in these populations:

    1. Rocks, of volume between 1 cm³ and 1 m³, found in the Schiaparelli crater and tested for haematite
    2. Rocks, of volume between 10 cm³ and 200 cm³, found in the Schiaparelli crater and tested for haematite
    3. Rocks, of volume between 10 cm³ and 200 m³, found in any crater on any planet of the solar system, and tested for haematite
    4. Rocks, of volume between 1 cm³ and 1 m³, found in the Schiaparelli crater and measured for the magnitude of their magnetic field.

    Populations a. b. c. above have the same variate but differ in their definition of “unit”. Populations a. and d. have the same definition of unit but different variates. Population b. is a subset of population a.: they have the same variate, and any unit in b. is also a unit in a.; but not every unit in a. is also a unit in b.. Populations a. and c. have some overlap: they have the same variate, and some units of a. are also units of c., and vice versa.

20.3 Populations with joint variates

The definition of statistical population (§  20.2) makes it clear that the quantity associated with each unit can be of arbitrary complexity. In particular it could be a joint quantity (§  13.1), that is, a collection of quantities of a simpler type.

We saw an example at the beginning of this chapter, with a population relevant for glass forensics:

  • units: glass fragments (collected at specific locations)
  • variate: the joint variate \((\mathit{RI}, \mathit{Ca}, \mathit{Si}, \mathit{Type})\) consisting of four simple variates:
    • \(\mathit{R}\)efractive \(\mathit{I}\)ndex of the glass fragment (interval continuous variate), with domain from \(1\) (included) to \(+\infty\)
    • weight percent of \(\mathit{Ca}\)lcium in the fragment (interval discrete variate), with domain from \(0\) to \(100\) in steps of 0.01
    • weight percent of \(\mathit{Si}\)licon in the fragment (interval discrete variate), with domain from \(0\) to \(100\) in steps of 0.01
    • \(\mathit{Type}\) of glass fragment (nominal variate), with seven possible values building_windows_float_processed, building_windows_non_float_processed, vehicle_windows_float_processed, vehicle_windows_non_float_processed, containers, tableware, headlamps

Here is a table with the values of the joint variate \((\mathit{RI}, \mathit{Ca}, \mathit{Si}, \mathit{Type})\) for ten units:

Table 20.1: Glass fragments
unit \(\mathit{RI}\) \(\mathit{Ca}\) \(\mathit{Si}\) \(\mathit{Type}\)
1 \(1.51888\) \(9.95\) \(72.50\) tableware
2 \(1.51556\) \(9.41\) \(73.23\) headlamps
3 \(1.51645\) \(8.08\) \(72.65\) building_windows_non_float_processed
4 \(1.52247\) \(9.76\) \(70.26\) headlamps
5 \(1.51909\) \(8.78\) \(71.81\) building_windows_float_processed
6 \(1.51590\) \(8.22\) \(73.10\) building_windows_non_float_processed
7 \(1.51610\) \(8.32\) \(72.69\) vehicle_windows_float_processed
8 \(1.51673\) \(8.03\) \(72.53\) building_windows_non_float_processed
9 \(1.51915\) \(10.09\) \(72.69\) containers
10 \(1.51651\) \(9.76\) \(73.61\) headlamps

The joint-variate value for unit 4, for instance, is

\[ \mathit{RI}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}1.52247 \land \mathit{Ca}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}9.76 \land \mathit{Si}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}70.26 \land \mathit{Type}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;headlamps;} \]

Important
  • In the terminology we agreed upon, a variate is not a quantity. Remember that something is a quantity if it makes sense to ask “Does it have value . . . ?”. Now consider the variate \(\mathit{Ca}\) (Calcium percentage) above. I say \(\mathit{Ca}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}8.1\). Can you check if what I said is true? No, because you don’t know which unit I’m referring to; you don’t even know if \(\mathit{Ca}\) was referring to some unit.

    The variate for a specific unit is a quantity instead. We can indicate this by appending the unit label to the variate symbol, as we did with \(\mathit{Ca}_{\color[RGB]{204,187,68}4}\) above. If I tell you \(\mathit{Ca}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}8\), you can check that what i said is false; so \(\mathit{Ca}_{\color[RGB]{204,187,68}4}\) is a quantity.


  • The units’ IDs don’t need to be consecutive numbers; in fact they don’t even need to be numbers: any label that completely distinguishes all units will do.
Exercises

2 This is an adapted version of the UCI “adult-income” dataset