The dataset
package provides tools to create
semantically rich and interoperable datasets in R. It improves metadata
handling by introducing new S3 classes—defined()
,
dataset_df()
, and bibrecord()
—that enhance the
behaviour of labelled
, tibble
, and
bibentry
objects to meet the requirements of:
Many tools exist to help document, describe, or publish datasets in R, but most separate the metadata from the data itself. This separation increases the risk of losing metadata, misaligning it with the data, or making documentation hard to maintain.
The dataset
package addresses this by storing all
metadata directly in R object attributes. This preserves semantic
information as data is transformed, combined, or exported, preventing
the loss of vital documentation and improving reproducibility.
defined()
An extended version of labelled()
vectors. Adds support
for:
library(dataset)
data(orange_df)
print(orange_df$age)
#> orange_df$age: The age of the tree
#> Measured in days since 1968/12/31
#> [1] 118 484 664 1004 1231 1372 1582 118 484 664 1004 1231 1372 1582 118
#> [16] 484 664 1004 1231 1372 1582 118 484 664 1004 1231 1372 1582 118 484
#> [31] 664 1004 1231 1372 1582
This ensures that, for example, “GDP” is always associated with a
precise concept and unit, avoiding ambiguity across analyses and
publications. See Semantically
Enriched Vectors with defined()
bibrecord()
An extension of R’s built-in bibentry()
class, with
support for:
dcterms
)as_dublincore(orange_df)
#> Dublin Core Metadata Record
#> --------------------------
#> Title: Growth of Orange Trees
#> Creator(s): N.R. Draper [cre] (http://viaf.org/viaf/84585260); H Smith [cre]
#> Contributor(s): :unas
#> Publisher: Wiley
#> Year: 1998
#> Language: en
#> Description: The Orange data frame has 35 rows and 3 columns of records of the growth of orange trees.
This makes it easier to produce citations and metadata suitable for
repositories like Zenodo or Dataverse. See more in the Modernising
Citation Metadata in R: Introducing bibrecord
dataset_df()
A semantic wrapper around data.frame
or
tibble
, aligning with SDMX’s data cube
model:
See more in the Why Semantics Matter for R Data Frames
<- dataset_df(
my_data country = defined(
c("AD", "LI"),
concept = "http://data.europa.eu/bna/c_6c2bb82d"),
gdp = defined(c(3897, 7365),
label = "GDP",
unit = "million euros"),
dataset_bibentry = datacite(
Title = "GDP Data for Small Countries",
Description = "Example Dataset for the dataset package",
Creator = person("Jane", "Doe"),
Publisher = "Open Data Institute",
Rights = "CC0",
Language = "en"
)
)
head(my_data)
#>
#>
#> rowid country gdp
#> <hvn_lbl_> <hvn_lbl_> <hvn_lbl_>
#> 1 eg:1 AD 3897
#> 2 eg:2 LI 7365
as_datacite(my_data)
#> DataCite Metadata Record
#> --------------------------
#> Title: GDP Data for Small Countries
#> Creator(s): Jane Doe
#> Contributor(s): :unas
#> Identifier: :tba
#> Publisher: Open Data Institute
#> Year: :tba
#> Language: en
#> Description: Example Dataset for the dataset package
We welcome contributions and discussion!
This project adheres to the rOpenSci Code of Conduct. By participating, you are expected to uphold these guidelines.