Skip to contents

A crash course in population inference

Population inferences

A very common type of inference is the following. We have a large or potentially infinite group of entities, each having a set of characteristics. Having checked the characteristics of several entities from this group, we want to make guesses about the characteristics of new entities from the same group.

The description just given is very generic or even vague. Indeed it can be applied to a huge variety of situations.

  • The ‘entities’ could be: objects, like electronic components; or people, or animals, or plants, or events of some kind.
  • The characteristics could be: the material the object is made of, and the result of some test made on it; or the age, blood-test results, and disease condition of the person; or the species and body mass of the animal; or the length of the petals of the flower.
  • The group could be: electronic components coming out of a particular production line; or people of a given nationality and with given symptoms; or animals from a particular island; or flowers of a specific genus or family.

The possibilities are endless.

Also the kinds of guesses that we want to make can be very diverse. We might want to guess all characteristics of a new entity; or, having checked some characteristics of a new entity, we want to guess the ones that we haven’t or can’t check. Important examples of this kind of inference appear in medicine. For example we may have a group defined as follows:

  • Group: people from a given nationality, suffering from one of two possible diseases.

And we may consider these characteristics:

  • Characteristics: age; sex; weight; presence or absence of a particular genetic factor; symptoms from a specific set of possible ones; results from clinical tests taken at different times; kind of disease.

We observed as many as possible of these characteristics in a sample of people from this group. Now a new person from the same group appears in front of us. We check this person’s age, sex, weight, symptoms. We need to guess which of the two diseases affects this person.


The kind of inference summarily described above has one important aspect. Suppose you have collected a sample from your group of interest, and you want to use this sample to make guesses about new entities from the same group. If someone exchanged one entity in your sample with another one unsystematically chosen from the group, then you wouldn’t protest. After all, you still have the same number of samples from the same group. This aspect is called exchangeability: we say that this kind of inference is exchangeable. There are kinds of inference for which this aspect is not true. For example, suppose you’re given some stock data from four consecutive days, and you want to make guesses about the next day. If someone replaced any of your four datapoints with a datapoint an unsystematically chosen other day of the year, then you would protest: the time order of the datapoints matter. This is an example of non-exchangeable inference.

The inferences for which inferno is designed are exchangeable ones. We shall also call them population inferences.


Having discussed these simple examples, let’s agree on some more standard terminology. Let’s call:

  • Population: what we’ve called ‘group’.
  • Unit: what we’ve called ‘entity’ – the object, person, animal, etc.
  • Variate: what we’ve called ‘characteristic’.


Probability and population frequencies

We have been speaking about “making guesses”; but what does this mean?

When a unit is chosen unsystematically from a population, it’s only in rare situations that we may be sure about the variates of this unit before checking them. Suppose for instance that the population of interest is ‘all adults from a given country’; and the variates of interest are sex, weight, height. If a person is chosen from this population, we can’t be sure beforehand of how tall this person will be. We can exclude values like 4 m or 20 cm, but we’ll be uncertain about many other possible values. Even if someone tells us the sex and weight of this person, we’ll still be in doubt, but maybe we can consider some values more probable than others.

That’s the keyword: probability. Although we are unsure about a variate of a unit, we can still find some values more probable than others: we may consider it more probable that the person is 160 cm tall than 180 cm tall; or in a clinical inference we may consider it more probable that the patient has a particular virus than not.

Probability is our degree of belief about the value of the unknown variate.

Note that probability is not a physical property of the unit. For instance, the person in question may be exactly 158 cm tall; the probability of being 180 cm tall is not something we can “measure” from the person. Also, if we found out some other variate of the person – say we knew the weight, and now we know also the sex – then the probability we assign to 180 cm may increase or decrease; but the person is exactly the same as before. Probability is not a property of the population either. For instance, for one person from a population we may think 150 cm to be the most probable height, but for another person from the same population we may think 180 cm to be the most probable height instead.

Probability expresses the information we have about the variate of a unit. This is why another researcher may have a different probability about the same unit: because they may have different information about that unit.


There is a situation in which we would all agree about the probability about a variate of a unit: when we know the frequencies for that variate in the population. Suppose for instance that the population of interest is that of penguins who lived in particular locations in some particular years (penguins from other locations or other times are not part of this population). The variates are the penguins’ species and the island they lived in.

You’re told that, as a matter of fact, 43.7% of penguins from the whole population are of species Adelie, 20.0% of species Chinstrap, and 36.3% of species Gentoo. Now a penguin from that population is brought in front of you, but you can’t see any of its characteristics. What’s your degree of belief that this penguin is of species Adelie, or Chinstrap, or Gentoo? We’d all agree, given the frequency information about this population, to assign the probabilities of 43.7%, 20.0%, and 36.3% to the three possibilities, for this penguin. We write this as follows: Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐴𝑑𝑒𝑙𝑖𝑒𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.437Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐶h𝑖𝑛𝑠𝑡𝑟𝑎𝑝𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.200Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐺𝑒𝑛𝑡𝑜𝑜𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.363 \begin{aligned} &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Adelie} \mid \textsf{population frequency}) = 0.437 \\ &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Chinstrap} \mid \textsf{population frequency}) = 0.200 \\ &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Gentoo} \mid \textsf{population frequency}) = 0.363 \end{aligned} On the left side of the ‘\mid’ bar we write what we’re guessing or are unsure about. On the right side, we write the information that led to our probability assignment; in this case, the information about the full population. When the information behind a probability is understood, the ‘\mid’ bar and the right side are usually omitted.

In this case, the highest probability is for the Adelie species, but the other two possibilities cannot be excluded. The numerical values of these probabilities are extremely important, because they determine any kind of decision we may have to make about our unit. This is especially true in medical decision-making, where probabilities, combined with utilities, determine which is the best choice that a clinician can make. Medical and clinical decision-making, and the role of probabilities in them, are discussed for instance in the texts by Sox & al.: Medical Decision Making (2024), by Hunink & al.: Decision Making in Health and Medicine (2014), or by Weinstein & Fineberg Clinical Decision Analysis (1980).

The example with the penguin population can be analysed further. The penguins in this population come from three possible islands: Biscoe, Dream, and Torgersen. Therefore a penguin can be of one of three species and from one of three islands, for a total of 3 × 3 = 9 possibilities. You’re told that, as a matter of fact, the frequencies of these nine combinations in the whole population are as follows:

Adelie Chinstrap Gentoo
Biscoe 12.6% 1.8% 34.0%
Dream 17.4% 16.5% 1.4%
Torgersen 13.8% 1.6% 0.9%

Then we’d all agree that the probability that the penguin brought to us is of species Gentoo and from island Biscoe would be 34.0%: Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐺𝑒𝑛𝑡𝑜𝑜,𝚒𝚜𝚕𝚊𝚗𝚍=𝐵𝑖𝑠𝑐𝑜𝑒𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.340 \mathrm{Pr}( \texttt{species}\!=\!\mathit{Gentoo} , \texttt{island}\!=\!\mathit{Biscoe} \mid \textsf{population frequency}) = 0.340


But there’s even more interesting information in the population frequencies above. Let’s focus on Biscoe island. With a quick sum we see that 48.4% of the whole population comes from Biscoe island (thus we’d assign a probability of 0.484 that our penguin comes from that island). Dividing the frequencies above, row-wise, by the frequencies of the respective islands (the sums of each row), we can find the frequency of each species for each particular island:

Adelie Chinstrap Gentoo
from Biscoe 26.03% 3.72% 70.25%
from Dream 49.29% 46.74% 3.97%
from Torgersen 84.66% 9.82% 5.52%

We call these conditional frequencies, and we call the group of penguins that come from island Biscoea subpopulation of the whole population. The table above reports the frequencies of the three species in each subpopulation.

Thus we also know, for instance, that 70.25% of penguins in the subpopulation from island Biscoe are of species Gentoo. This species is the majority in that subpopulation – contrast this with the majority in the whole population, which we saw was Adelie. Indeed, your most probable guess about the species of the penguin in front of you was Adelie, with 0.437 probability.

But suppose now someone tells you that this penguin comes from island Biscoe (so this variate is now known to you). Given this new piece of information, which probabilities do you assign to the species of this penguin? Obviously 0.2603 for Adelie, 0.0372 for Chinstrap, and 0.7025 for Gentoo. We write this as follows: Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐴𝑑𝑒𝑙𝑖𝑒𝚒𝚜𝚕𝚊𝚗𝚍=𝐵𝑖𝑠𝑐𝑜𝑒,𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.2603Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐶h𝑖𝑛𝑠𝑡𝑟𝑎𝑝𝚒𝚜𝚕𝚊𝚗𝚍=𝐵𝑖𝑠𝑐𝑜𝑒,𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.0372Pr(𝚜𝚙𝚎𝚌𝚒𝚎𝚜=𝐺𝑒𝑛𝑡𝑜𝑜𝚒𝚜𝚕𝚊𝚗𝚍=𝐵𝑖𝑠𝑐𝑜𝑒,𝗉𝗈𝗉𝗎𝗅𝖺𝗍𝗂𝗈𝗇 𝖿𝗋𝖾𝗊𝗎𝖾𝗇𝖼𝗒)=0.7025 \begin{aligned} &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Adelie} \mid \texttt{island}\!=\!\mathit{Biscoe} , \textsf{population frequency}) = 0.2603 \\ &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Chinstrap} \mid \texttt{island}\!=\!\mathit{Biscoe} , \textsf{population frequency}) = 0.0372 \\ &\mathrm{Pr}(\texttt{species}\!=\!\mathit{Gentoo} \mid \texttt{island}\!=\!\mathit{Biscoe} , \textsf{population frequency}) = 0.7025 \end{aligned} The right side of the ‘\mid’ bar now reports the extra information that the penguin comes from island Biscoe.

Learning about the penguin’s island not only made you change the highest probability assignment from Aelie to Gentoo, but it also increased the value of the highest probability. Without knowing the penguin’s island, Adelie had slightly less than 50% probability; it was as likely as not. After learning the island variate, Gentoo gets more than 70% probability.

This is exactly the kind of learning situation that happens in medicine and clinical inferences. A clinician searches for symptoms because they may change and increase the probabilities of different diseases or health conditions. The reason why clinicians research particular subgroups – particular demographics, or genetic factors, etc. – is that within these subgroups the probability that a medical condition exist or will occur can be drastically higher. In turn, this leads the clinicians to understand better what can be the biological relationships between the health condition and those factors.

Population samples and uncertainty

Inferences are therefore quite clear if we know the frequencies of the variates in a whole population. Our main problem is that we usually don’t know those population frequencies. For example, do you know the exact number of people, among those born in your country at any time and are alive today, who are exactly between 165 cm and 170 cm tall today? And it’s often impractical or impossible to measure those frequencies.

(TO BE CONTINUED)




(OLD TEXT)

We store the name of the datafile

dataset <- penguins

Here are the values for two subjects:

species island bill_len bill_dep flipper_len body_mass sex year
225 Gentoo Biscoe 48.2 15.6 221 5100 male 2008
127 Adelie Torgersen 38.8 17.6 191 3275 female 2009
315 Chinstrap Dream 46.9 16.6 192 2700 female 2008
189 Gentoo Biscoe 42.6 13.7 213 4950 female 2008
271 Gentoo Biscoe 47.2 13.7 214 4925 female 2009

“Natural” vs controlled or “biased” variates

Metadata

Let’s load the package:

## metadatatemplate(data = datafile, file = 'temp_metadata')
metadatafile <- 'meta_toydata.csv'

The metadata are as follows; NA indicate empty fields:

name type datastep domainmin domainmax minincluded maxincluded V1 V2
TreatmentGroup nominal NR Placebo
Sex nominal Female Male
Age continuous 1 0
Anamnestic.Loss.of.smell nominal No Yes
History.of.REM.Sleep.Behaviour.Disorder nominal No Yes
MDS.UPDRS..Subsection.III.V1 ordinal 1 0 132 yes yes
diff.MDS.UPRS.III ordinal 1 -132 132 yes yes

(More to be written here!)

Learning

Pr(Y=y|X=x,data) \mathrm{Pr}(Y = y \:\vert\: X = x, \mathrm{data}) As the notation above indicates, these probabilities also depend on the data\mathrm{data} already observed. They are usually called “posterior probabilities”.

We need to prepare the software to perform calculations of posterior probabilities given the observed data. In machine learning an analogous process is called “learning”. For this reason the function that achieves this is called learn(). It requires at least three arguments:

  • data, which can be given as a path to the csv file containing the data
  • metadata, which can also be given as a path to the metadata file
  • outputdir: the name of the directory where the output should be saved.

It may be useful to specify two more arguments:

  • seed: the seed for the random-number generator, to ensure reproducibility
  • parallel: the number of CPUs to use for the computation

Alternatively you can set the seed with set.seed(), and start a set of parallel workers with the parallel::makeCluster() and doParallel::registerDoParallel() commands.

The “learning” computation can take tens of minutes, or hours, or even days depending on the number of variates and data in your inference problem. The learn() function outputs various messages on how the computation is proceeding. As an example:

learnt <- learn(
    data = datafile,
    metadata = metadatafile,
    outputdir = 'parkinson_computations',
    seed = 16,
    parallel = 12)

#> Registered doParallelSNOW with 12 workers
#>
#> Using 30 datapoints
#> Calculating auxiliary metadata
#>
#> **************************
#> Saving output in directory
#> parkinson_computations
#> **************************
#> Starting Monte Carlo sampling of 3600 samples by 60 chains
#> in a space of 703 (effectively 6657) dimensions.
#> Using 12 cores: 60 samples per chain, 5 chains per core.
#> Core logs are being saved in individual files.
#>
#> C-compiling samplers appropriate to the variates (package Nimble)
#> this can take tens of minutes with many data or variates.
#> Please wait...

Drawing inferences

(TO BE CONTINUED)