A crash course in population inference
Population inferences
A very common type of inference is the following. We have a large or potentially infinite group of entities, each having a set of characteristics. Having checked the characteristics of several entities from this group, we want to make guesses about the characteristics of new entities from the same group.
The description just given is very generic or even vague. Indeed it can be applied to a huge variety of situations.
- The ‘entities’ could be: objects, like electronic components; or people, or animals, or plants, or events of some kind.
- The characteristics could be: the material the object is made of, and the result of some test made on it; or the age, blood-test results, and disease condition of the person; or the species and body mass of the animal; or the length of the petals of the flower.
- The group could be: electronic components coming out of a particular production line; or people of a given nationality and with given symptoms; or animals from a particular island; or flowers of a specific genus or family.
The possibilities are endless.
Also the kinds of guesses that we want to make can be very diverse. We might want to guess all characteristics of a new entity; or, having checked some characteristics of a new entity, we want to guess the ones that we haven’t or can’t check. Important examples of this kind of inference appear in medicine. For example we may have a group defined as follows:
- Group: people from a given nationality, suffering from one of two possible diseases.
And we may consider these characteristics:
- Characteristics: age; sex; weight; presence or absence of a particular genetic factor; symptoms from a specific set of possible ones; results from clinical tests taken at different times; kind of disease.
We observed as many as possible of these characteristics in a sample of people from this group. Now a new person from the same group appears in front of us. We check this person’s age, sex, weight, symptoms. We need to guess which of the two diseases affects this person.
The kind of inference summarily described above has one important
aspect. Suppose you have collected a sample from your group of interest,
and you want to use this sample to make guesses about new entities from
the same group. If someone exchanged one entity in your sample with
another one unsystematically chosen from the group, then you
wouldn’t protest. After all, you still have the same number of samples
from the same group. This aspect is called
exchangeability: we say that this kind of inference is
exchangeable. There are kinds of inference for which this aspect is not
true. For example, suppose you’re given some stock data from four
consecutive days, and you want to make guesses about the next day. If
someone replaced any of your four datapoints with a datapoint an
unsystematically chosen other day of the year, then you would protest:
the time order of the datapoints matter. This is an example of
non-exchangeable inference.
The inferences for which inferno is designed are exchangeable ones. We shall also call them population inferences.
Having discussed these simple examples, let’s agree on some more
standard terminology. Let’s call:
- Population: what we’ve called ‘group’.
- Unit: what we’ve called ‘entity’ – the object, person, animal, etc.
- Variate: what we’ve called ‘characteristic’.
Probability and population frequencies
We have been speaking about “making guesses”; but what does this mean?
When a unit is chosen unsystematically from a population, it’s only
in rare situations that we may be sure about the variates of this unit
before checking them. Suppose for instance that the population of
interest is ‘all adults from a given country’; and the variates of
interest are sex
, weight
, height
.
If a person is chosen from this population, we can’t be sure beforehand
of how tall this person will be. We can exclude values like 4 m or
20 cm, but we’ll be uncertain about many other possible values. Even if
someone tells us the sex and weight of this person, we’ll still be in
doubt, but maybe we can consider some values more probable than
others.
That’s the keyword: probability. Although we are unsure about a variate of a unit, we can still find some values more probable than others: we may consider it more probable that the person is 160 cm tall than 180 cm tall; or in a clinical inference we may consider it more probable that the patient has a particular virus than not.
Probability is our degree of belief about the value of the unknown variate.
Note that probability is not a physical property of the unit. For instance, the person in question may be exactly 158 cm tall; the probability of being 180 cm tall is not something we can “measure” from the person. Also, if we found out some other variate of the person – say we knew the weight, and now we know also the sex – then the probability we assign to 180 cm may increase or decrease; but the person is exactly the same as before. Probability is not a property of the population either. For instance, for one person from a population we may think 150 cm to be the most probable height, but for another person from the same population we may think 180 cm to be the most probable height instead.
Probability expresses the information we have about the variate of a unit. This is why another researcher may have a different probability about the same unit: because they may have different information about that unit.
There is a situation in which we would all agree about the probability
about a variate of a unit: when we know the frequencies
for that variate in the population. Suppose for instance that the
population of interest is that of penguins who lived in particular
locations in some particular years (penguins from other locations or
other times are not part of this population). The variates are the
penguins’ species
and the island
they lived
in.
You’re told that, as a matter of fact, 43.7% of penguins from the whole population are of species Adelie, 20.0% of species Chinstrap, and 36.3% of species Gentoo. Now a penguin from that population is brought in front of you, but you can’t see any of its characteristics. What’s your degree of belief that this penguin is of species Adelie, or Chinstrap, or Gentoo? We’d all agree, given the frequency information about this population, to assign the probabilities of 43.7%, 20.0%, and 36.3% to the three possibilities, for this penguin. We write this as follows: On the left side of the ‘’ bar we write what we’re guessing or are unsure about. On the right side, we write the information that led to our probability assignment; in this case, the information about the full population. When the information behind a probability is understood, the ‘’ bar and the right side are usually omitted.
In this case, the highest probability is for the Adelie species, but the other two possibilities cannot be excluded. The numerical values of these probabilities are extremely important, because they determine any kind of decision we may have to make about our unit. This is especially true in medical decision-making, where probabilities, combined with utilities, determine which is the best choice that a clinician can make. Medical and clinical decision-making, and the role of probabilities in them, are discussed for instance in the texts by Sox & al.: Medical Decision Making (2024), by Hunink & al.: Decision Making in Health and Medicine (2014), or by Weinstein & Fineberg Clinical Decision Analysis (1980).
The example with the penguin population can be analysed further. The
penguins in this population come from three possible
island
s: Biscoe, Dream, and
Torgersen. Therefore a penguin can be of one of three species
and from one of three islands, for a total of 3 × 3 = 9 possibilities.
You’re told that, as a matter of fact, the frequencies of these nine
combinations in the whole population are as follows:
Adelie | Chinstrap | Gentoo | |
Biscoe | 12.6% | 1.8% | 34.0% |
Dream | 17.4% | 16.5% | 1.4% |
Torgersen | 13.8% | 1.6% | 0.9% |
Then we’d all agree that the probability that the penguin brought to
us is of species
Gentoo and from
island
Biscoe would be 34.0%:
But there’s even more interesting information in the population frequencies above. Let’s focus on Biscoe island. With a quick sum we see that 48.4% of the whole population comes from Biscoe island (thus we’d assign a probability of 0.484 that our penguin comes from that island). Dividing the frequencies above, row-wise, by the frequencies of the respective islands (the sums of each row), we can find the frequency of each species for each particular island:
Adelie | Chinstrap | Gentoo | |
from Biscoe | 26.03% | 3.72% | 70.25% |
from Dream | 49.29% | 46.74% | 3.97% |
from Torgersen | 84.66% | 9.82% | 5.52% |
We call these conditional frequencies, and we call
the group of penguins that come from island
Biscoea subpopulation of the whole population.
The table above reports the frequencies of the three
species
in each subpopulation.
Thus we also know, for instance, that 70.25% of penguins in the
subpopulation from island
Biscoe are of
species
Gentoo. This species is the majority in
that subpopulation – contrast this with the majority in the whole
population, which we saw was Adelie. Indeed, your most probable
guess about the species
of the penguin in front of you was
Adelie, with 0.437 probability.
But suppose now someone tells you that this penguin comes from
island
Biscoe (so this variate is now known to
you). Given this new piece of information, which probabilities do you
assign to the species
of this penguin? Obviously 0.2603 for
Adelie, 0.0372 for Chinstrap, and 0.7025 for
Gentoo. We write this as follows:
The right side of the
‘’
bar now reports the extra information that the penguin comes from
island
Biscoe.
Learning about the penguin’s island
not only made you
change the highest probability assignment from Aelie to
Gentoo, but it also increased the value of the highest
probability. Without knowing the penguin’s island
,
Adelie had slightly less than 50% probability; it was as likely
as not. After learning the island
variate, Gentoo
gets more than 70% probability.
This is exactly the kind of learning situation that happens in medicine and clinical inferences. A clinician searches for symptoms because they may change and increase the probabilities of different diseases or health conditions. The reason why clinicians research particular subgroups – particular demographics, or genetic factors, etc. – is that within these subgroups the probability that a medical condition exist or will occur can be drastically higher. In turn, this leads the clinicians to understand better what can be the biological relationships between the health condition and those factors.
Population samples and uncertainty
Inferences are therefore quite clear if we know the frequencies of the variates in a whole population. Our main problem is that we usually don’t know those population frequencies. For example, do you know the exact number of people, among those born in your country at any time and are alive today, who are exactly between 165 cm and 170 cm tall today? And it’s often impractical or impossible to measure those frequencies.
(TO BE CONTINUED)
(OLD TEXT)
We store the name of the datafile
dataset <- penguins
Here are the values for two subjects:
species | island | bill_len | bill_dep | flipper_len | body_mass | sex | year | |
---|---|---|---|---|---|---|---|---|
225 | Gentoo | Biscoe | 48.2 | 15.6 | 221 | 5100 | male | 2008 |
127 | Adelie | Torgersen | 38.8 | 17.6 | 191 | 3275 | female | 2009 |
315 | Chinstrap | Dream | 46.9 | 16.6 | 192 | 2700 | female | 2008 |
189 | Gentoo | Biscoe | 42.6 | 13.7 | 213 | 4950 | female | 2008 |
271 | Gentoo | Biscoe | 47.2 | 13.7 | 214 | 4925 | female | 2009 |
Metadata
Let’s load the package:
## metadatatemplate(data = datafile, file = 'temp_metadata')
metadatafile <- 'meta_toydata.csv'
The metadata are as follows; NA
indicate empty
fields:
name | type | datastep | domainmin | domainmax | minincluded | maxincluded | V1 | V2 |
---|---|---|---|---|---|---|---|---|
TreatmentGroup | nominal | NR | Placebo | |||||
Sex | nominal | Female | Male | |||||
Age | continuous | 1 | 0 | |||||
Anamnestic.Loss.of.smell | nominal | No | Yes | |||||
History.of.REM.Sleep.Behaviour.Disorder | nominal | No | Yes | |||||
MDS.UPDRS..Subsection.III.V1 | ordinal | 1 | 0 | 132 | yes | yes | ||
diff.MDS.UPRS.III | ordinal | 1 | -132 | 132 | yes | yes |
(More to be written here!)
Learning
As the notation above indicates, these probabilities also depend on the already observed. They are usually called “posterior probabilities”.
We need to prepare the software to perform calculations of posterior
probabilities given the observed data. In machine learning an analogous
process is called “learning”. For this reason the function that achieves
this is called learn()
. It requires at least three
arguments:
-
data
, which can be given as a path to thecsv
file containing the data -
metadata
, which can also be given as a path to the metadata file -
outputdir
: the name of the directory where the output should be saved.
It may be useful to specify two more arguments:
-
seed
: the seed for the random-number generator, to ensure reproducibility -
parallel
: the number of CPUs to use for the computation
Alternatively you can set the seed with set.seed()
, and
start a set of parallel workers with the
parallel::makeCluster()
and
doParallel::registerDoParallel()
commands.
The “learning” computation can take tens of minutes, or hours, or
even days depending on the number of variates and data in your inference
problem. The learn()
function outputs various messages on
how the computation is proceeding. As an example:
learnt <- learn(
data = datafile,
metadata = metadatafile,
outputdir = 'parkinson_computations',
seed = 16,
parallel = 12)
#> Registered doParallelSNOW with 12 workers
#>
#> Using 30 datapoints
#> Calculating auxiliary metadata
#>
#> **************************
#> Saving output in directory
#> parkinson_computations
#> **************************
#> Starting Monte Carlo sampling of 3600 samples by 60 chains
#> in a space of 703 (effectively 6657) dimensions.
#> Using 12 cores: 60 samples per chain, 5 chains per core.
#> Core logs are being saved in individual files.
#>
#> C-compiling samplers appropriate to the variates (package Nimble)
#> this can take tens of minutes with many data or variates.
#> Please wait...