Metadata and helper function to create a template metadata file or object.
Usage
metadatatemplate(
data,
file = NULL,
includevrt = NULL,
excludevrt = NULL,
addsummary2metadata = FALSE,
backupfiles = FALSE,
verbose = TRUE
)
Arguments
- data
A dataset, given as a
data.frame
or as a file path to a csv file.- file
String: name of csv file where the metadata should be saved; if
NULL
: output metadata asVALUE
.- includevrt
Character or
NULL
: name of variates in dataset to be included.- excludevrt
Character or
NULL
: name of variates in dataset to be excluded.- addsummary2metadata
Logical: also output some diagnostic statistics in the metadata? Default
FALSE
.- backupfiles
Logical: rename previous metadata file if it exists? Default
TRUE
.- verbose
Logical: output heuristics for each variate? Default
TRUE
.
Value
If file = NULL
, a preliminary metadata file is created
and VALUE
is NULL
;
otherwise VALUE
is a data.frame
containing the metadata.
Details
The learn
function needs metadata about the variates present in the data. Such metadata can be provided either as a csv
file or as a data.frame
. The function buildmetadata
creates a template metadata csv-file, or outputs a metadata data.frame, by trying to guess metadata information from the dataset.The guesses may be very incorrect (as already said, metadata is information not contained in the data, so no algorithm can exist that extracts it from the data). The user must modify and correct this template, using it as a starting point to prepare the correct metadata information.
Metadata information and format
In order to correctly learn from a dataset, the learn
function needs information that is not contained in the data themeselves; that is, it needs metadata. Metadata are provided either as a csv
file or as a data.frame
.
A metadata file or data.frame must contain one row for each simple variate in the given inference problem, and the following fields (columns), even if some of them may be empty:
name
, type
, domainmin
, domainmax
, datastep
, minincluded
, maxincluded
, V1
, V2
, (possibly additional V
-fields, sequentially numbered)
The type
field has three possible values: nominal
, ordinal
, continuous
. The remaining fields that must be filled in depend on the type
field. Here is a list of requirements:
nominal
andordinal
: require eitherV1
,V2
, ... fields ordomainmin
,domainmax
,datastep
(all three) fields. No other fields are required.continuous
: requiresdomainmin
,domainmax
,datastep
,minincluded
,maxincluded
.
Here are the meanings and possible values of the fields:
name
: The name of the variate. This must be the same character string as it appears in the dataset (be careful about upper- and lower-case).
type
: The data type of variate name
. Possible values are nominal
, ordinal
, continuous
.
A nominal (also called categorical) variate has a discrete, finite number of possible values which have no intrinsic ordering. Examples could be a variate related to colour, with values "red", "green", "blue", and so on; or a variate related to cat breeds, with values "Siamese", "Abyssinian", "Persian", and so on. The possible values of the variate must be given in the fields
V1
,V2
, and so on. It is important to include values that are possible but are not present in the dataset. A variate having only two possible values (binary variate), for example "yes" and "no", can be specified as nominal.An ordinal variate has a discrete, finite number of possible values which do have an intrinsic ordering. Examples could be a Likert-scaled variate for the results of a survey, with values "very dissatisfied", "dissatisfied", "satisfied", "very satisfied"; or a variate related to the levels of some quantities, with values "low", "medium", "high"; or a variate having a numeric scale with values from 1 to 10. Whether a variate is nominal or ordinal often depends on the context. The possible values of the variate but be given in either one (but not both) or two ways: (1) in the fields
V1
,V2
, ..., as for nominal variates; (2) as the fieldsdomainmin
,domainmax
,datastep
. Option (2) only works with numeric, equally spaced values: it assumes that the first value isdomainmin
, the second isdomainmin
+datastep
, the third isdomainmin
+2*datastep
, and so on up to the last value,domainmax
.A continuous variate has a continuum of values with an intrinsic ordering. Examples could be a variate related to the width of an object; or to the age of a person; or one coordinate of an object in a particular reference system. A continuous variate requires specification of the fields
domainmin
,domainmax
,datastep
,minincluded
,maxincluded
. Some naturally continuous variates are often rounded to a given precision; for instance, the age of a person might be reported as rounded to the nearest year (25 years, 26 years, and so on); or the length of an object might be reported to the nearest centimetre (1 m, 1.01 m, 1.02 m, and so on). The minimum distance between such rounded values must be reported in thedatastep
field; this would be1
in the age example and0.01
in the length example above. See below for further explanation of why reporting such rounding is important.
domainmin
: The minimum value that the variate (ordinal or continuous) can take on. Possible values are a real number or an empty value, which is then interpreted as -Inf
(explicit values like -Inf
, -inf
, -infinity
should also work). Some continuous variates, like age or distance or temperature, are naturally positive, and therefore have domainmin
equal 0
. But in other contexts the minimum value could be different. For instance, if a given inference problem only involves people of age 18 or more, then domainmin
would be set to 18
.
domainmax
: The maximum value that the variate (ordinal or continuous) can take on. Possible values are a real number, or an empty value, which is then interpreted as +Inf
(explicit values like Inf
, inf
, infinity
should also work). As with domainmin
, the maximum value depends on the context. An age-related variate could theoretically have domainmax
equal to infinity (empty value in the metadata file); but if a given study categorizes some people as "90 years old or older", then domainmax
should be set to 90
.
datastep
: The minimum distance between the values of a variate (ordinal or continuous). Possible values are a positive real number or an empty value, which is then interpreted as 0 (the explicit value 0
is also accepted). For a numeric ordinal variate, datastep
is the step between consecutive values. For a continuous rounded variate, datastep
is the minimum distance between different values that occurs because of rounding; see the examples given above. The function buildmetadata
has some heuristics to determine whether the variate is rounded or not. See further details under the section Rounding below.
minincluded
, maxincluded
: Whether the minimum (domainmin
) and maximum(domainmax
) values of a continuous variate can really appear in the data or not. Possible values are true
(or t
or yes
) or false
(or f
, no
, or an empty field); upper- or lower-case is irrelevant. Here are some examples about the meaning of these fields. (a) A continuous unrounded variate such as temperature has 0 as a minimum possible value domainmin
, but this value itself is physically impossible and can never appear in data; in this case minincluded
is empty (or set to false
or no
). (b) A variate related to the unrounded length, in metres, of some objects may take on any positive real value; but suppose that all objects of length 5 or less are grouped together under the value 5
. It is then possible for several datapoints to have value 5
: one such datapoint could originally have the value 3.782341...; another the value 4.929673..., and so on. In this case domainmin
is set to 5
, and minincluded
is set to true
(or yes
). Similarly for the maximum value of a variate and maxincluded
. Note that if domainmin
is minus-infinity (empty value in the metadata file), then minincluded
is automatically empty (that is, false
), and similarly for maxincluded
if domainmax
is infinity.