Skip to contents

Purpose

Here is a toy example showing some kinds of results that can be obtained with Bayesian nonparametric exchangeable inference. The example involves inferences about some investigations on Parkinson’s Disease described in the paper by Brakedal & al. (2022), which generously and professionally shares its data.

The investigations in the cited paper involve 30 subjects, for each of which various data are collected. Fifteen of these subjects belong to a control group, the other 15 to the treatment group, denoted NR. This group received a nicotinamide adenine dinucleotide (NAD) replenishment therapy, via oral intake of nicotinamide riboside (NR). In this toy example we do not follow or reproduce the paper’s investigations, nor do we use all the variates observed. We limit ourselves to 7 variates for each subject. Here is a list with a brief description of each:

  • Treatment group: Binary variate with values Placebo and NR.

  • Sex: Binary variate with values Female and Male.

  • Age: Integer variate, or continuous but rounded variate.

  • Anamnestic loss of smell: Binary variate with values No and Yes.

  • History of REM sleep behaviour disorder: Binary variate with values No and Yes.

  • MDS-UPDRS-III score at the first subject visit: The MDS-UPDRS-III (Movement Disorder Society’s Unified Parkinson’s Disease Rating Scale) is one of a set of ordinal scales to assess a subject’s degree of disability and monitor it over time. The scale III ranges from 0 (no disability) to 132 (highest degree of disability). Given that operations such as sum or average are performed on these scales in the literature, the scales should be considered on an interval kind rather than simply ordinal.

  • Difference in MDS-UPDRS-III scores between second and first visit: Ordinal variate, the difference of the previous one between the subject’s second and first visit. Its range is therefore from -132 to +132, negative values representing an improvement.


We store the name of the datafile

datafile <- 'toydata.csv'

Here are the values for two subjects:

TreatmentGroup Sex Age Anamnestic.Loss.of.smell History.of.REM.Sleep.Behaviour.Disorder MDS.UPDRS..Subsection.III.V1 diff.MDS.UPRS.III
Placebo Male 66 No No 44 -2
NR Female 65 Yes No 32 10

“Natural” vs controlled or “biased” variates

For the purpose of drawing scientific inferences, it is extremely important to be mindful of which variates in the dataset have “natural” statistics, and which instead have statistics controlled by the observer – and are therefore “biased”. This distinction is important because the dataset statistics of the controlled variates may then not be representative of the naturally-occurring ones, and therefore cannot be used for inference purposes.

In the present dataset, for instance, the numbers of Placebo and NR subjects were decided by the investigators to be in a 50%/50% proportion. Obviously this is not a naturally occurring proportion; in fact, it does not even make sense of speaking of a naturally occurring statistics for this TreatmentGroup variate.

For the Sex variate, the dataset has a proportion of around 17% Female and 83% Male subjects. This proportion also partly stems from the selection procedure of the investigators, so we cannot infer that roughly 16% of Parkinson patients in the national population are females. It is also possible that the proportions in the Age variate may depend on the selection procedure.

On the other hand, the investigators did not choose the statistics of the two MDS.UPDRS variates, within the groups having particular values of the other variates. This conditional or subpopulation statistics can be taken as representatives of naturally-occurring ones (see Lindley & Novick (1980) for a brilliant discussion of this point).

(add short explanation of exchangeability)

Metadata

Let’s load the package:

In order to draw inferences, we need metadata information, that is, information about the variates that is not present in the data. Such information mainly regard the kind and domain of each variate.

As an example of why metadata is necessary, suppose you are given the data 3, 7, 9. Can you say whether the value 5 could be observed in a new instance? and what about 4.7, or -10? Knowing whether these values are at all possible, or not, greatly affects the precision of the inferences we draw. Obviously a researcher must know such basic information about the data being collected – otherwise they wouldn’t know what they’re doing.

The software requires specification of metadata either as a data.frame object or as a .csv file. To help you prepare this file, the software offers the function metadatatemplate(), which writes a template that you thereafter modify. The template is filled with tentative values; these are wild guesses from the data, so they are often wrong. The function explains the heuristic it is using in order to pre-fill the fields of the metadata file, and gives warnings about guessed values that are especially dubious. You should examine and correct all fields:

metadatatemplate(data = datafile, file = 'temp_metadata')
#> 
#> Analyzing 7 variates for 30 datapoints.
#> 
#> * "TreatmentGroup" variate:
#>   -  2 different  values detected:
#>  "NR", "Placebo" 
#>   which do not seem to refer to an ordered scale.
#>   Assuming variate to be NOMINAL.
#> 
#> * "Sex" variate:
#>   -  2 different  values detected:
#>  "Female", "Male" 
#>   which do not seem to refer to an ordered scale.
#>   Assuming variate to be NOMINAL.
#> 
#> * "Age" variate:
#>   - Numeric values between 29 and 75 
#>   Assuming variate to be CONTINUOUS.
#>   - Distance between datapoints is a multiple of 1 
#>   Assuming variate to be ROUNDED.
#>   - All values are positive
#>   Assuming "domainmin" to be 0
#> 
#> * "Anamnestic.Loss.of.smell" variate:
#>   -  2 different  values detected:
#>  "No", "Yes" 
#>   which do not seem to refer to an ordered scale.
#>   Assuming variate to be NOMINAL.
#> 
#> * "History.of.REM.Sleep.Behaviour.Disorder" variate:
#>   -  2 different  values detected:
#>  "No", "Yes" 
#>   which do not seem to refer to an ordered scale.
#>   Assuming variate to be NOMINAL.
#> 
#> * "MDS.UPDRS..Subsection.III.V1" variate:
#>   - Numeric values between 15 and 50 
#>   Assuming variate to be CONTINUOUS.
#>   - Distance between datapoints is a multiple of 1 
#>   Assuming variate to be ROUNDED.
#>   - All values are positive
#>   Assuming "domainmin" to be 0
#> 
#> * "diff.MDS.UPRS.III" variate:
#>   - Numeric values between -9 and 10 
#>   Assuming variate to be CONTINUOUS.
#>   - Distance between datapoints is a multiple of 1 
#>   Assuming variate to be ROUNDED.
#> =========
#> WARNINGS - please make sure to check these variates in the metadata file:
#> 
#> * "Age" variate appears to be continuous and rounded,
#> but it could also be an ordinal variate
#> 
#> * "MDS.UPDRS..Subsection.III.V1" variate appears to be continuous and rounded,
#> but it could also be an ordinal variate
#> 
#> * "diff.MDS.UPRS.III" variate appears to be continuous and rounded,
#> but it could also be an ordinal variate
#> =========
#> 
#> Saved proposal metadata file as "temp_metadata.csv"

We can open the temporary metadata file temp_metadata.csv and edit it with any software of our choice. Let’s save the corrected version as meta_toydata.csv:

metadatafile <- 'meta_toydata.csv'

The metadata are as follows; NA indicate empty fields:

name type datastep domainmin domainmax minincluded maxincluded V1 V2
TreatmentGroup nominal NR Placebo
Sex nominal Female Male
Age continuous 1 0
Anamnestic.Loss.of.smell nominal No Yes
History.of.REM.Sleep.Behaviour.Disorder nominal No Yes
MDS.UPDRS..Subsection.III.V1 ordinal 1 0 132 yes yes
diff.MDS.UPRS.III ordinal 1 -132 132 yes yes

For a description of the metadata fields, see the help for metadata with ?inferno::metadata.

(More to be written here!)

Learning

Our inferences will be based on the data already observed, and will mainly take the form of probabilities about the unkonwn values various quantities YY, given the observed values of other quantities XX: Pr(Y=y|X=x,data) \mathrm{Pr}(Y = y \:\vert\: X = x, \mathrm{data}) As the notation above indicates, these probabilities also depend on the data\mathrm{data} already observed. They are usually called “posterior probabilities”.

We need to prepare the software to perform calculations of posterior probabilities given the observed data. In machine learning an analogous process is called “learning”. For this reason the function that achieves this is called learn(). It requires at least three arguments:

  • data, which can be given as a path to the csv file containing the data
  • metadata, which can also be given as a path to the metadata file
  • outputdir: the name of the directory where the output should be saved.

It may be useful to specify two more arguments:

  • seed: the seed for the random-number generator, to ensure reproducibility
  • parallel: the number of CPUs to use for the computation

Alternatively you can set the seed with set.seed(), and start a set of parallel workers with the parallel::makeCluster() and doParallel::registerDoParallel() commands.

The “learning” computation can take tens of minutes, or hours, or even days depending on the number of variates and data in your inference problem. The learn() function outputs various messages on how the computation is proceeding. As an example:

learnt <- learn(
    data = datafile,
    metadata = metadatafile,
    outputdir = 'parkinson_computations',
    seed = 16,
    parallel = 12)

#> Registered doParallelSNOW with 12 workers
#>
#> Using 30 datapoints
#> Calculating auxiliary metadata
#>
#> **************************
#> Saving output in directory
#> parkinson_computations
#> **************************
#> Starting Monte Carlo sampling of 3600 samples by 60 chains
#> in a space of 703 (effectively 6657) dimensions.
#> Using 12 cores: 60 samples per chain, 5 chains per core.
#> Core logs are being saved in individual files.
#>
#> C-compiling samplers appropriate to the variates (package Nimble)
#> this can take tens of minutes with many data or variates.
#> Please wait...

At the end of the computation the learn() function returns the name of the directory in which the results are saved. In the code snipped above, this name is saved in the variable learnt. The object containing the information needed to make inferences is the learnt.rds file within this directory.

Drawing inferences

We are finally ready to use the software for drawing inferences, testing hypotheses, and similar scientific investigations. From the observed data, the software allows us to find the probability of any question or hypothesis of interest. It is important, though, that the hypothesis be stated in an unambiguous, quantitative form.

First example: effect on treated patient of female sex

For example, we might want to ask: “Does the NAD-replenishment therapy have an effect on females?”. At first this may sound like a valid scientific hypothesis. But if we carefully examine the statement we realize that it is too vague. First, what does “having an effect” mean exactly? how can this be quantified? Second, can we really say that the therapy either has or has not an effect? are there no degrees of efficacy involved? Third, whom are we asking this question about? the next patient that we may see? or the population of all the remaining patients? Let us make our initial question quantitatively clear.

First, let us translate “effect” into the difference in MDS-UPDRS-III scores between the first and second visits of a subject (visit 2 - visit 1), represented by the variate diff.MDS.UPRS.III. Note that other choices could be made, such as the difference in another MDS-UPDRS score (I–IV), or in the total of these scores, or as the ratio rather than the difference – or as some other variate altogether.

Second, instead of asking the unrealistic dichotomous question “has or has not?”, we ask how strong the effect is, as quantified by the value of the MDS-UPDRS-III decrease.

Third, let us ask this question about a possible next patient. We shall see that this question is also related to the population as a whole.

We can calculate the probability that, in the next patient, we shall observe any particular value yy of the MDS-UPDRS-III difference, given that the patient is Female, and given what has been observed in the clinical trials: Pr(MDS-UPDRS-III difference=y|Sex=Female,Treatment group=NR,data) \mathrm{Pr}(\text{MDS-UPDRS-III difference} = y \:\vert\: \text{Sex} = \text{Female}, \text{Treatment group} = \text{NR}, \mathrm{data})

The function for this is Pr(Y, X, learnt), which accepts three arguments:

  • Y: the variates and values of which we want the probability
  • X: the variates and values which are known
  • learnt: the knowledge acquired from the dataset

Let us first ask this for a specific value of the MDS-UPDRS-III difference, let’s say 0:

probability0 <- Pr(
    Y = data.frame(diff.MDS.UPRS.III = 0),
    X = data.frame(Sex = 'Female', TreatmentGroup = 'NR'),
    learnt = learnt
)

The output of the Pr() function is a list of several objects, of which we discuss only two for the moment:

  • value: this is the value of the probability we wanted to know. In this case it is
probability0$value
#>       X
#> Y            [,1]
#>   [1,] 0.06420201

that is, there is approximately a 6.4% probability that the next treated female patient will have a MDS-UPDRS-III difference of exactly 0.

  • quantiles: these values tell us how much the probability could change, if our knowledge were updated by more clinical trials. They thus quantify the uncertainty of our present probabilistic conclusions. A future, updated probability has a 89% probability of lying between the two extreme (5.5% and 94.5%) quantile values, and a 50% probability of lying between the middle (25% and 75%) ones q(these are typical intervals used in the Bayesian literature, see for instance Makowski et al.). In this case they are
probability0$quantiles
#> , ,  = 5.5%
#> 
#>       X
#> Y            [,1]
#>   [1,] 0.02893608
#> 
#> , ,  = 25%
#> 
#>       X
#> Y            [,1]
#>   [1,] 0.04481232
#> 
#> , ,  = 75%
#> 
#>       X
#> Y            [,1]
#>   [1,] 0.07756895
#> 
#> , ,  = 94.5%
#> 
#>       X
#> Y           [,1]
#>   [1,] 0.1119128

that is, if we had many more clinical trials, the probability above could change to a value within around 2.9% and 11%.


We can obviously also ask for the probabilities of all possible values that the MDS-UPDRS-III difference could have in the next treated female patient:

probabilities <- Pr(
    Y = data.frame(diff.MDS.UPRS.III = (-30):30),
    X = data.frame(Sex = 'Female', TreatmentGroup = 'NR'),
    learnt = learnt
)

Each probability thus obtained has two accompanying quantiles and related uncertainty. Both can be visualized with the plot() function:

plot(probabilities)

Two aspects of this plot are relevant for our investigation:

  • There seems to be slightly more probability mass towards negative MDS-UPDRS-III differences – which means an improvement in the degree of disability. We shall quantify this further in a moment.

  • However, it is also clear that further clinical trials could lead to a quite different probability distribution. This is shown by the breadth of the quantile range.

Another way to graphically represent and see potential variability of the result if more trials were available, is to plot some examples – say 100 – of how the results could differ. Such examples are stored in the third object, samples given by the Pr() function:

plot(probabilities, variability = 'samples',
    ylim = c(0, max(probabilities$quantiles)) # keep same y-range as before
    )


Let us ask a more interesting question, which can be a quantitative counterpart of our initial question “Does the NAD-replenishment therapy have an effect on females?”. We ask: will the next treated female patient experience a MDS-UPDRS-III difference equal to 1-1 or less?“. We therefore want the probability Pr(MDS-UPDRS-III diff.1|Sex=Female,Treatment=NR,data) \mathrm{Pr}(\text{MDS-UPDRS-III diff.} \le -1 \:\vert\: \text{Sex} = \text{Female}, \text{Treatment} = \text{NR}, \mathrm{data})

This probability can be calculated with the function tailPr(Y, X, learnt), which is analogous to the Pr() function, but calculates the probability of YyY\le y rather than of Y=yY = y.

probDecrease <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Female', TreatmentGroup = 'NR'),
    learnt = learnt
)

The answer is that the probability of a decrease in this disability score is

probDecrease$value
#>       X
#> Y           [,1]
#>   [1,] 0.5497687

or approximately 55%; note that it is clearly higher than 50%.

Further clinical trials, however could update this probability to a new value between 34% and 75%:

probDecrease$quantiles[1,1,]
#>      5.5%       25%       75%     94.5% 
#> 0.3428941 0.4639758 0.6415482 0.7474421

Second example: comparison with non-treated patient of female sex

In the previous example we found that there is a 55% probability that a treated female patient will experience a decrease in MDS-UPDRS-III after the therapy. But can this decrease really be correlated with the therapy?

To preliminarily answer this question, we can find the analogous probabilities for a female patient that has not received the treatment. Again we use Pr(), but our X argument will specify a Placebo value for the Treatment group:

probabilitiesPlacebo <- Pr(
    Y = data.frame(diff.MDS.UPRS.III = (-30):30),
    X = data.frame(Sex = 'Female', TreatmentGroup = 'Placebo'),
    learnt = learnt
)

Let’s plot these probabilities, together with their variability (only the most extreme case), on top of the ones previously obtained for the treated case:

plotquantiles(x = (-30):30, y = probabilities$quantiles[,1, c('5.5%', '94.5%')],
    ylim = c(0, NA),
    xlab = 'MDS-UPDRS-III difference', ylab = 'probability', col = 3)

plotquantiles(x = (-30):30, y = probabilitiesPlacebo$quantiles[,1, c('5.5%', '94.5%')],
    col = 2, add = TRUE)

flexiplot(x = (-30):30, y = probabilities$values[,1], col = 3, lwd = 2, add = TRUE)

flexiplot(x = (-30):30, y = probabilitiesPlacebo$values[,1],
    col = 2, lty = 2, add = TRUE)

## add 0-change line
abline(v = 0, lwd = 1, col = 5)

legend('topright', legend = c('NR, Female', 'Placebo, Female'),
    bty = 'n', lty = 1:2, col = 3:2)

There does not seem to be any difference that stands out between the treated and non-treated case.

As before, we can calculate the probability of a decrease in MDS-UPDRS-III:

probDecreasePlacebo <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Female', TreatmentGroup = 'Placebo'),
    learnt = learnt
)

We find that the probability of a decrease in the disability score for a non-treated female patient is

probDecreasePlacebo$value
#>       X
#> Y           [,1]
#>   [1,] 0.5432515

or approximately 54%. Further clinical trials, however could update this probability to a value between 33% and 74%:

probDecreasePlacebo$quantiles[1,1,]
#>      5.5%       25%       75%     94.5% 
#> 0.3332187 0.4538897 0.6325576 0.7434010

Therefore we see a similarly probable decrease in a non-treated as well as a non-treated female patient. There is a very small difference in these two probabilities, in favour of the treated case; but this difference is too small compared with how the probabilities could change if more trials were available.

Some preliminary conclusions and possible questions for future research:

  • There does not seem to be a difference between treated and non-treated female patients, regarding MDS-UPDRS-III decrease.

  • However, there is a slightly likely decrease for both. What could this decrease be correlated with?

Third example: considering age

Our results so far indicate practically no difference in the probabilities of MDS-UPDRS-III decrease of a treated vs a non-treated female patient. These probabilities, however, disregard the patient’s age. Is it possible that there are larger differences in different age groups?

We can answer this question by calculating the probabilities conditional on any particular Age value, that is, Pr(MDS-UPDRS-III diff.1|Sex=Female,Treatment=NR,Age=a,data) \mathrm{Pr}(\text{MDS-UPDRS-III diff.} \le -1 \:\vert\: \text{Sex} = \text{Female}, \text{Treatment} = \text{NR}, \text{Age} = a, \mathrm{data}) for meaningful values aa of the age, let’s say between 25 and 80. Again we can find them with the function tailPr(). This time our conditioning argument X will include a grid of Age values. First we do the calculation for a treated female patient, then for a non-treated one:

probDecreaseAge <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Female', Age = 25:80, TreatmentGroup = 'NR'),
    learnt = learnt
)

probDecreaseAgePlacebo <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Female', Age = 25:80, TreatmentGroup = 'Placebo'),
    learnt = learnt
)

We can plot the probabilities of an MDS-UPDRS-III decrease against that patient’s age, with one plot for each treatment group:

plotquantiles(x = 25:80, y = probDecreaseAge$quantiles[1,, c('5.5%', '94.5%')],
    ylim = c(0, NA),
    xlab = 'age', ylab = 'probability of MDS-UPDRS-III decr.', col = 3)

plotquantiles(x = 25:80, y = probDecreaseAgePlacebo$quantiles[1,, c('5.5%', '94.5%')],
    col = 2, add = TRUE)

flexiplot(x = 25:80, y = probDecreaseAge$values[1,], col = 3, lwd = 2, add = TRUE)

flexiplot(x = 25:80, y = probDecreaseAgePlacebo$values[1,],
    col = 2, lwd = 2, lty = 2, add = TRUE)

## add 50%-probability line
abline(h = 0.5, lwd = 1, col = 5)

legend('topright', legend = c('NR, Female', 'Placebo, Female'),
    bty = 'n', lty = 1:2, col = 3:2)

This plot shows some interesting aspects:

  • There does not seem to be a large difference, between treated and non-treated females, in the probability of a MDS-UPDRS-III decrease.

  • In the 60–65 age group, an increase in MDS-UPDRS-III is more likely than a decrease, for both a treated and a non-treated female patient. What could be the reason for the increase in this narrow age group?

Again notice that these probabilities could be updated to very different values, if more clinical trials were available.

Fourth example: male patient

We can perform an analysis analogous to the previous examples, but for a male patient. Here are the corresponding function calls and plots.

probabilitiesMale <- Pr(
    Y = data.frame(diff.MDS.UPRS.III = (-30):30),
    X = data.frame(Sex = 'Male', TreatmentGroup = c('NR', 'Placebo')),
    learnt = learnt
)

plotquantiles(x = (-30):30, y = probabilitiesMale$quantiles[,1, c('5.5%', '94.5%')],
    ylim = c(0, NA),
    xlab = 'MDS-UPDRS-III difference', ylab = 'probability', col = 7)

plotquantiles(x = (-30):30, y = probabilitiesMale$quantiles[,2, c('5.5%', '94.5%')],
    col = 6, add = TRUE)

flexiplot(x = (-30):30, y = probabilitiesMale$values[,1],
    col = 7, lwd = 2, add = TRUE)

flexiplot(x = (-30):30, y = probabilitiesMale$values[,2],
    col = 6, lwd = 2, lty = 2, add = TRUE)

## add 0-change line
abline(v = 0, lwd = 1, col = 5)

legend('topright', legend = c('NR, Male', 'Placebo, Male'),
    bty = 'n', lty = 1:2, col = 7:6)

And considering all ages separately:

probMaleDecreaseAge <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Male', Age = 25:80, TreatmentGroup = 'NR'),
    learnt = learnt
)

probMaleDecreaseAgePlacebo <- tailPr(
    Y = data.frame(diff.MDS.UPRS.III = -1),
    X = data.frame(Sex = 'Male', Age = 25:80, TreatmentGroup = 'Placebo'),
    learnt = learnt
)


plotquantiles(x = 25:80, y = probMaleDecreaseAge$quantiles[1,, c('5.5%', '94.5%')],
    ylim = c(0, NA),
    xlab = 'age', ylab = 'probability of MDS-UPDRS-III decr.', col = 7)

plotquantiles(x = 25:80, y = probMaleDecreaseAgePlacebo$quantiles[1,, c('5.5%', '94.5%')],
    col = 6, add = TRUE)

flexiplot(x = 25:80, y = probMaleDecreaseAge$values[1,],
    col = 7, lwd = 2, add = TRUE)

flexiplot(x = 25:80, y = probMaleDecreaseAgePlacebo$values[1,],
    col = 6, lwd = 2, lty = 2, add = TRUE)

## add 50%-probability line
abline(h = 0.5, lwd = 1, col = 5)

legend('topright', legend = c('NR, Male', 'Placebo, Male'),
    bty = 'n', lty = 1:2, col = 7:6)

Finally let’s compare treated female vs male patients, per age:

plotquantiles(x = 25:80, y = probDecreaseAge$quantiles[1,, c('5.5%', '94.5%')],
    ylim = c(0, NA),
    xlab = 'age', ylab = 'probability of MDS-UPDRS-III decr.', col = 3)

plotquantiles(x = 25:80, y = probMaleDecreaseAge$quantiles[1,, c('5.5%', '94.5%')],
    col = 7, add = TRUE)

flexiplot(x = 25:80, y = probDecreaseAge$values[1,],
    col = 3, lwd = 2, add = TRUE)

flexiplot(x = 25:80, y = probMaleDecreaseAge$values[1,],
    col = 7, lwd = 2, lty = 2, add = TRUE)

## add 50%-probability line
abline(h = 0.5, lwd = 1, col = 5)

legend('topright', legend = c('NR, Female', 'NR, Male'),
    bty = 'n', lty = 1:2, col = c(3, 7))

(TO BE CONTINUED)