Core function to compute the posterior probability distribution of the variates conditional on the given data.
Usage
learn(
data,
metadata,
auxdata = NULL,
outputdir = NULL,
nsamples = 3600,
nchains = 8,
nsamplesperchain = 450,
parallel = NULL,
seed = NULL,
cleanup = TRUE,
appendtimestamp = TRUE,
appendinfo = TRUE,
output = "directory",
subsampledata = NULL,
prior = missing(data) || is.null(data),
startupMCiterations = 3600,
minMCiterations = 0,
maxMCiterations = +Inf,
maxhours = +Inf,
ncheckpoints = 12,
maxrelMCSE = +Inf,
minESS = 450,
initES = 2,
thinning = NULL,
plottraces = !cleanup,
showKtraces = FALSE,
showAlphatraces = FALSE,
hyperparams = list(ncomponents = 64, minalpha = -4, maxalpha = 4, byalpha = 1, Rshapelo
= 0.5, Rshapehi = 0.5, Rvarm1 = 3^2, Cshapelo = 0.5, Cshapehi = 0.5, Cvarm1 = 3^2,
Dshapelo = 0.5, Dshapehi = 0.5, Dvarm1 = 3^2, Bshapelo = 1, Bshapehi = 1, Dthreshold
= 1, tscalefactor = 4.266, Oprior = "Hadamard", Nprior = "Hadamard", avoidzeroW =
NULL, initmethod = "datacentre", Qerror = pnorm(c(-1, 1)))
)
Arguments
- data
A dataset, given as a
base::data.frame()
or as a file path to a CSV file.- metadata
A
metadata
object, given either as a data.frame object, or as a file pa to a CSV file.- auxdata
A larger dataset, given as a base::data.frame() or as a file path to a CSV file. Such a dataset would be too many to use in the Monte Carlo sampling, but can be used to calculate hyperparameters.
- outputdir
Character: path to folder where the output should be saved. If omitted, a directory is created that has the same name as the data file but with suffix "
_output_
".- nsamples
Integer: number of desired Monte Carlo samples. Default 3600.
- nchains
Integer: number of Monte Carlo chains. Default 4.
- nsamplesperchain
Integer: number of Monte Carlo samples per chain.
- parallel
Logical or
NULL
or positive integer:TRUE
: use roughly half of available cores;FALSE
: use serial computation;NULL
(default): don't do anything (use pre-registered condition); integer: use this many cores.- seed
Integer: use this seed for the random number generator. If missing or
NULL
(default), do not set the seed.- cleanup
Logical: remove diagnostic files at the end of the computation? Default
TRUE
.- appendtimestamp
Logical: append a timestamp to the name of the output directory
outputdir
? DefaultTRUE
.- appendinfo
Logical: append information about dataset and Monte Carlo parameters to the name of the output directory
outputdir
? DefaultTRUE
.- output
Character: if
'directory'
, return the output directory name asVALUE
; if character'learnt'
, return thelearnt
object containing the parameters obtained from the Monte Carlo computation. Any other value:VALUE
isNULL
.- subsampledata
Integer: use only a subset of this many datapoints for the Monte Carlo computation.
- prior
Logical: Calculate the prior distribution?
- startupMCiterations
Integer: number of initial Monte Carlo iterations. Default 3600.
- minMCiterations
Integer: minimum number of Monte Carlo iterations to be doneby a chain. Default 0.
- maxMCiterations
Integer: Do at most this many Monte Carlo iterations per chain. Default
Inf
.- maxhours
Numeric: approximate time limit, in hours, for the Monte Carlo computation to last. Default
Inf
.- ncheckpoints
Integer: number of datapoints to use for checking when the Monte Carlo computation should end. If
NULL
, this is equal to number of variates + 2. If Inf, use all datapoints. Default 12.- maxrelMCSE
Numeric positive: desired maximal relative Monte Carlo Standard Error of calculated probabilities with respect to their variability with new data. Default
+Inf
, so thatminESS
is used instead.maxrelMCSE
is related tominESS
bymaxrelMCSE = 1/sqrt(minESS + initES)
.- minESS
Numeric positive: desired minimal Monte Carlo Expected Sample Size. If
NULL
, it is equal to the finalnsamplesperchain
. Default 400.minESS
is related tomaxrelMCSE
byminESS = 1/maxrelMCSE^2 - initES
.- initES
Numeric positive: number of initial Expected Samples to discard.
- thinning
Integer: thin out the Monte Carlo samples by this value. If
NULL
(default): let the diagnostics decide the thinning value.- plottraces
Logical: save plots of the Monte Carlo traces of diagnostic values? Default
TRUE
.- showKtraces
Logical: save plots of the Monte Carlo traces of the K parameter? Default
FALSE
.- showAlphatraces
Logical: save plots of the Monte Carlo traces of the Alpha parameter? Default
FALSE
.- hyperparams
List: hyperparameters of the prior.
Value
Name of directory containing output files, or learnt object, or NULL
, depending on argument output
.
Details
This function takes as main inputs a set of data and metadata, and computes the probability distribution for new data. Its computation can also be interpreted as an estimation of the frequencies of the variates in the whole population, beyond the sample data. The probability distribution is not assumed to be Gaussian or of any other specific shape. The computation is done via Markov-chain Monte Carlo.
This function creates an object, contained in a learnt.rds
file, which is used in all subsequent probabilistic computations. Other information about the computation is provided in logs and plots, saved in a directory specified by the user.
See vignette('inferno_start')
for an introductory example.