“A dataset is an identifiable collection of data available for access or download in one or more formats.” ISO/IEC 20546
The dataset
package is designed to enhance the
semantic richness of R datasets, particularly those
structured as data.frame
or tibble
objects.
Its goal is to promote reusability and interoperability by enabling tidy
datasets to carry rich metadata — from the start.
Through practical iterations, it became evident that the structure of
a dataset cannot be separated from its purpose. As a
result, the dataset
package has evolved into a family of
interrelated tools that support semantic clarity at both the column and
dataset level.
data.frame(
geo = c("LI", "SM"),
CPI = c("0.8", "0.9"),
GNI = c("8976", "9672")
)
#> geo CPI GNI
#> 1 LI 0.8 8976
#> 2 SM 0.9 9672
At first glance, this dataset appears tidy — columns are variables, rows are observations. But is it interpretable?
geo
: This might refer to geography — but is “LI”
Lichtenstein? What about “SM”? Is this ISO alpha-2? Eurostat’s codes?
World Bank’s?
CPI
: Is this the Consumer Price Index, or the
Corruption Perceptions Index? Something else entirely?
GNI
: It could mean Gross National Income or Global
Nutrition Index — and even if it’s GNI, is it measured in dollars,
euros, or something else?
A tidy structure doesn’t guarantee semantic clarity. Without metadata, every column is open to misinterpretation.
options(sciphen = 4)
small_country_dataset <- dataset_df(
country_name = defined(
c("AD", "LI"),
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(
c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
),
population = defined(
c(77543, 40015),
label = "Population",
concept = "http://data.europa.eu/bna/c_f2b50efd"
),
dataset_bibentry = dublincore(
creator = person(given = "Jane", family = "Doe"),
title = "Small Country Dataset",
publisher = "Reprex"
)
)
small_country_dataset$gdp_capita <- defined(
small_country_dataset$gdp * 1e6 / small_country_dataset$population,
unit = "dollar",
label = "GDP Per Capita"
)
The interoperability and future
reusability of data depends on the quality and presence of
metadata — not just the structure of the data itself. The
dataset
package captures metadata as seamlessly and
non-intrusively as possible.
It covers:
Let’s take a look at the limitations of two, tidy datasets from an interoperability and reusability point of view:
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c("United States", "France", "United States", "France", "United States", "France"),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
These datasets are tidy, and from a technical perspective, we can easily combine them:
full_join(D, E)
#> Joining with `by = join_by(year, geocode, country_name, GDP)`
#> # A tibble: 12 × 4
#> year geocode country_name GDP
#> <dbl> <chr> <chr> <dbl>
#> 1 2020 USA United States 21000
#> 2 2020 CAN Canada 2000
#> 3 2021 USA United States 22000
#> 4 2021 CAN Canada 2100
#> 5 2022 USA United States 23000
#> 6 2022 CAN Canada 2200
#> 7 2020 USA United States 18000
#> 8 2020 FRA France 2500
#> 9 2021 USA United States 19000
#> 10 2021 FRA France 2600
#> 11 2022 USA United States 20000
#> 12 2022 FRA France 2700
This join is syntactically valid — but semantically misleading.
One column contains GDP values in USD, the other in EUR.
There’s no indication of this difference in the dataset itself.
Anyone reusing the dataset (especially if it’s saved to CSV or shared without code comments) could misinterpret it — perhaps fatally in an analysis.
Even in R, the metadata (e.g. currency) is detached from the values.
This problem worsens when datasets are consumed by users who do not read
R code or access the original documentation. Worse still, variables like
CPI
might refer to the Consumer Price Index or the
Corruption Perceptions Index — and nothing in the dataset
guarantees which one was meant.
The dataset
package introduces defined()
vectors, which carry metadata like units, variable labels, namespaces,
and definitions. When combining such vectors, this metadata is checked
for compatibility.
You cannot concatenate GDP vectors that are denominated in different currencies:
Our strictly defined
vectors check if the variable
label, unit, namespace and definition match. If they are missing, the
concatenation is possible, but when they are present, they must
match.
You cannot concatenate USD and EUR denominated values:
GDP_D <- defined(c(21000, 2000, 22000, 2100, 23000, 2200), unit = "M_USD")
GDP_E <- defined(c(18000, 2500, 19000, 2600, 20000, 2700), unit = "M_EUR")
c(GDP_D, GDP_E)
message("Error in c.haven_labelled_defined(GDP_D, GDP_E) :
c.haven_labelled_defined(x,y): x,y must have no unit or the same unit")
#> Error in c.haven_labelled_defined(GDP_D, GDP_E) :
#> c.haven_labelled_defined(x,y): x,y must have no unit or the same unit
Concatenation is only allowed when metadata , in this case, the unit of measure, or more precisely, the monetary unit matches:
Many metadata packages in R aim to enrich datasets after
they’re built — when preparing them for publication, export, or
external documentation. In contrast, the dataset
package
embeds semantics at the moment of dataset creation.
This early intervention:
Adds value during exploratory and analytical work
Avoids costly ambiguity later
Makes R-native datasets more robust and shareable
Enables integration with packages like frictionless
,
dataspice
, or rdflib
(the latter two are
intended for deep future integration)
This design aligns with the European Interoperability Framework (EIF), which outlines four levels of interoperability:
Legal
Organisational
Semantic
Technical
While many R packages address the technical layer
(e.g. through standards-based file formats), and some touch on legal
metadata, the dataset
family places emphasis on the
organisational and semantic layers — capturing
meaning, context, and responsibility from the inside out.
While tidy tabular data and graph-based data may share similar metadata needs, their workflows — and the assumptions they carry — differ considerably.
Carl Boettiger and others have shown how tidy data frames can be represented as RDF triples. In fact, serialising long-format tabular data into RDF is now a common approach for semantic data exchange. However, the origin of the data — whether from a graph, tabular database, or API — strongly influences its semantic shape and metadata requirements.
To support this diversity of reuse, the dataset
package
is complemented by two specialised extensions:
datacube
: Statistical Data and Metadata ExchangeThis planned package focuses on interoperability with SDMX (Statistical Data and Metadata eXchange), a standard adopted by statistical offices such as Eurostat and the World Bank.
Target user: Analysts working with structured tabular indicators and hierarchies
Goal: Enable analysis and transformation within R, and export to SDMX-compatible formats
Design: Enforces strict data.frame
structure, linked
concepts, units, and classifications aligned to SDMX
wbdataset
: Knowledge Graph CompatibilityThis evolving package supports interoperability with Wikidata and the Wikibase Data Model, which underpin the world’s largest open knowledge graph.
Target user: Analysts working with Wikidata or custom Wikibase instances
Goal: Enable two-way workflows — from SPARQL query to enriched analysis and back to triple-contributions
Design: Facilitates conversion between tidy data and semantic triples
rdflib
: Native Graph IntegrationFor graph-first use cases, the excellent rdflib
package
provides RDF parsing and serialisation within R. dataset
is
fully compatible and designed to work with rdflib
, allowing
metadata-enriched dataset_df
objects to serve as staging
grounds for RDF generation.
The dataset
package is based on a simple principle:
Data and metadata should live together.
This is achieved by using R’s powerful (but underused)
attributes()
mechanism to store metadata directly
on vectors and data frames. Attributes are preserved in
.rds
and .rda
formats, ensuring metadata
travels with the data through most R workflows.
D_enriched <- dataset_df(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200),
dataset_bibentry = dataset::dublincore(
title = "North American GDP Dataset",
description = "Dataset containing GDP data for North American countries.",
creator = person("Daniel", "Antal"), # Replace with the actual creator
publisher = "Reprex",
dataset_date = Sys.Date(),
subject = "GDP"
)
)
D_enriched
#> Antal (2025): North American GDP Dataset [dataset]
#> rowid year geocode country_name GDP
#> <defined> <dbl> <chr> <chr> <dbl>
#> 1 eg:1 2020 USA United States 21000
#> 2 eg:2 2020 CAN Canada 2000
#> 3 eg:3 2021 USA United States 22000
#> 4 eg:4 2021 CAN Canada 2100
#> 5 eg:5 2022 USA United States 23000
#> 6 eg:6 2022 CAN Canada 2200
Other packages (like dataspice
) also aim to improve
metadata capture. But many of them:
Store metadata in separate CSV files
Focus on publication-time workflows
Leave room for detachment or mismatch between data and metadata
In contrast, dataset
stores metadata in the
object itself, alongside the data, from the moment of creation.
This ensures:
Tighter integration between metadata and data
Metadata survives transformations and joins
Continuous improvement of metadata during analysis
The enriched dataset gives us more information about the organisation of the dataset as a whole, but still suffers from a loose definition of its structural elements, i.e., the rows and the columns. We will come back to this in the following sections.
The dataset
package provides functions with sensible
defaults and structures compliant with international metadata standards.
The dublincore()
and datacite()
functions
create enhanced bibentry objects and offer many functions to change the
elements of these objects (i.e., to change the publication year, the
names of the contributors, etc.) These functions facilitate adding
metadata about the dataset as a whole (i.e., the overall organization of
the data, not individual rows or columns). It also establishes a
foundation for adding metadata about the internal structure of the
dataset, such as the meaning and intended use of rows and columns.
Currently, the dataset
package’s approach to
row/column-level metadata differs depending on whether the data is
intended for a “datacube” or “wbdataset” (likely referring to a specific
data format or use case). This is because datasets reused in tabular or
graph forms have distinct semantic requirements, which necessitate
different function interfaces even if the underlying metadata needs are
similar. Future development may address these differences more
uniformly.
The dataset
package shares similar goals with the
rOpenSci dataspice
package, particularly concerning tabular
data. dataspice
focuses on making datasets more
discoverable online by adding Schema.org metadata, a lightweight
ontology used by web search engines. A key interoperability goal for
dataset
is to capture all the metadata that
dataspice
utilizes throughout the analytical
workflow and store it directly within the R data.frame
’s
attributes.
While dataspice
encourages users to provide this
metadata separately in a CSV file (a convenient and simple approach),
this method has two critical weaknesses:
Knowledge graphs offer a more robust solution by updating data and
metadata simultaneously, ensuring consistency and enforcing strict
metadata definitions. The dataset
package aims to promote
continuous metadata collection and storage within the R object itself,
saving it alongside the data. R’s ability to store rich metadata in
attributes makes saving in .rda
or .rds
files
a compelling option, even if these formats aren’t universally
interoperable.
Inspired by knowledge graphs, dataset
extends the use of
attributes to include graph-like metadata, such as recording the
dataset’s provenance: who downloaded the original data, who performed
manipulations, which R packages (“software agents”) were used, and so
on.
Tidy data principles, well-established among R users, promote data reusability through a simple structure: rows represent observations, and columns represent variables with meaningful names. This aligns with the 3NF normal form in relational databases and, as Carl Boettiger illustrates, can also represent graph entries in a 3-column long format (subject–predicate–object). However, two key limitations require going beyond standard tidy principles:
While the haven
and labelled
packages in
the tidyverse provide enhanced column labels (similar to SPSS and
STATA), they are insufficient for complex workflows involving numerous
datasets. dataset
addresses these limitations in two
ways:
defined()
class, which inherits from
labelled
, adds a definition and a
namespace to labels, enhancing their meaning and meeting the
requirements for linked open data.dataset
records data provenance (who
downloaded, manipulated, etc.) in the attributes, using a simplified
version of the PROV model and PROV-O ontology. (Further development of
this feature is planned for a separate package.)The defined()
class also allows extending tidyverse row
IDs with a namespace or prefix, enabling conversion to globally unique
identifiers (GUIDs), critical when working with data from many
sources.
Beyond row and column identification, dataset
addresses
the question of when to define a new dataset, a challenge only
partially answered in earlier versions and now delegated to the
datacube
package. This question hinges on the concept of a
datacube, a structural enhancement of tidy data. Datacubes
categorize variables into:
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c(
"United States", "Canada", "United States",
"Canada", "United States", "Canada"
),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c("United States", "France", "United States", "France", "United States", "France"),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
joined_data <- dplyr::full_join(D, E)
#> Joining with `by = join_by(year, geocode, country_name, GDP)`
joined_data
#> # A tibble: 12 × 4
#> year geocode country_name GDP
#> <dbl> <chr> <chr> <dbl>
#> 1 2020 USA United States 21000
#> 2 2020 CAN Canada 2000
#> 3 2021 USA United States 22000
#> 4 2021 CAN Canada 2100
#> 5 2022 USA United States 23000
#> 6 2022 CAN Canada 2200
#> 7 2020 USA United States 18000
#> 8 2020 FRA France 2500
#> 9 2021 USA United States 19000
#> 10 2021 FRA France 2600
#> 11 2022 USA United States 20000
#> 12 2022 FRA France 2700
The datacube
package emphasizes the distinction between
these variable types. For example, joining datasets with shared row IDs
and columns isn’t meaningful without considering units of measure (e.g.,
euros vs. dollars). Dimensions determine when a new dataset is needed.
dataset
, through the defined()
class, allows
specifying units of measure, while datacube
handles the
broader needs of attributes, measures, and dimensions.
The rOpenSci rdflib
package, a wrapper for the Python
library of the same name, provides powerful tools for reading and
writing graph data. We believe dataset
and
rdflib
are complementary and should be used together. While
there’s a small overlap (inspired by an internal rdflib
function for working with NQuads, a form of RDF), we’ve avoided making
rdflib
a direct dependency of dataset
.
However, for users working with RDF-annotated graphs, we strongly
recommend using dataset
in conjunction with
rdflib
for importing, exporting, and exchanging data.
First, let us see how we would solve the ambiguity problems without
dataset
, relying only on tidyverse and
rdflib.
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c(
"United States", "Canada", "United States",
"Canada", "United States", "Canada"
),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c(
"United States", "France", "United States",
"France", "United States", "France"
),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
# Combine datasets
combined <- bind_rows(D, E)
# Convert all to character *before* pivoting
combined_char <- combined %>%
mutate(across(everything(), as.character))
# Rename geocode to subject
combined_subject <- combined_char %>%
rename(subject = geocode)
# Pivot longer, excluding the subject (triples are named, but predicates are not renamed yet)
named_triples <- combined_subject %>%
pivot_longer(
cols = c(year, country_name, GDP),
names_to = "predicate",
values_to = "object"
)
# Replace geocodes with Geonames IDs and add datatype annotations
geonames_mapping <- tibble(
subject = c("USA", "CAN", "FRA"),
geonames_id = c("6252001", "6251999", "2988317")
)
rdf_triples <- named_triples %>%
left_join(geonames_mapping, by = "subject") %>%
mutate(
subject = paste0("geonames:", geonames_id),
predicate = case_when(
predicate == "year" ~ "sdmx:TIME_PERIOD",
predicate == "country_name" ~ "schema:name",
predicate == "GDP" ~ "sdmx:OBS_VALUE",
TRUE ~ predicate
),
object = case_when(
predicate == "sdmx:TIME_PERIOD" ~ paste0("\"", object, "\"^^<http://www.w3.org/2001/XMLSchema#gYear>"),
predicate == "sdmx:OBS_VALUE" ~ paste0("\"", object, "\"^^<http://www.w3.org/2001/XMLSchema#double>"),
TRUE ~ object
)
) %>%
select(subject, predicate, object)
print("RDF Triples (subject, predicate, object):")
#> [1] "RDF Triples (subject, predicate, object):"
print(rdf_triples)
#> # A tibble: 36 × 3
#> subject predicate object
#> <chr> <chr> <chr>
#> 1 geonames:6252001 sdmx:TIME_PERIOD "\"2020\"^^<http://www.w3.org/2001/XMLSche…
#> 2 geonames:6252001 schema:name "United States"
#> 3 geonames:6252001 sdmx:OBS_VALUE "\"21000\"^^<http://www.w3.org/2001/XMLSch…
#> 4 geonames:6251999 sdmx:TIME_PERIOD "\"2020\"^^<http://www.w3.org/2001/XMLSche…
#> 5 geonames:6251999 schema:name "Canada"
#> 6 geonames:6251999 sdmx:OBS_VALUE "\"2000\"^^<http://www.w3.org/2001/XMLSche…
#> 7 geonames:6252001 sdmx:TIME_PERIOD "\"2021\"^^<http://www.w3.org/2001/XMLSche…
#> 8 geonames:6252001 schema:name "United States"
#> 9 geonames:6252001 sdmx:OBS_VALUE "\"22000\"^^<http://www.w3.org/2001/XMLSch…
#> 10 geonames:6251999 sdmx:TIME_PERIOD "\"2021\"^^<http://www.w3.org/2001/XMLSche…
#> # ℹ 26 more rows
The frictionless
package has a somewhat similar
approach, but it does not support either a metadata standard or a strict
serialisation standard.
Both rdflib
and frictionless
provide strong
foundations for exporting datasets, but they differ in scope and
assumptions. The rdflib
package utilises the World Wide Web
RDF standard, which is also used by SDMX and many open science
repositories. It supports multiple serialisation forms, including
JSON-LD. By contrast, frictionless
relies on JSON
containers but lacks a formal ontology layer.
Both packages can preserve technical metadata, but neither guides users toward a consistent semantic model. They provide the vehicle, but not the fuel.
The dataset
package — especially through its downstream
extensions like datacube
and wbdataset
— aims
to fill this gap, offering metadata curation aligned with high levels of
organisational and semantic interoperability. datacube
focuses on SDMX-style statistical datasets, while wbdataset
helps users adopt the Wikibase data model and workflow.
Unlike rdflib
, which requires RDF/OWL expertise, and
frictionless
, which delegates semantics to arbitrary field
descriptions, dataset
provides structure and guidance
without requiring users to leave the R environment.
Let us return to the ISO/IEC 20546 definition used as a motto:
“A dataset is an identifiable collection of data available for access or
download in one or more formats.”
Datasets should be published with their future reuse in mind. Several R packages support dataset publication, each targeting different use cases and levels of interoperability:
rdflib
package envisions reuse in fully semantic environments, using RDF as a
standard.dataspice
package focuses on enhancing discoverability by adding Schema.org
metadata to HTML documentation.frictionless
package promotes packaging metadata (in JSON) and data (in CSV) for
structured publication.wbdataset
package (in development) supports
publication to Wikibase, which powers Wikidata.The dataset
package is designed to be compatible with
all of these tools. It enables seamless export to:
rdflib
for RDF-based graph publicationdataspice
for search engine–optimised publishingwbdataset
for contribution to the Wikibase
ecosystemAlthough frictionless
is currently not a direct target,
it can be supported if user demand arises.
Wikibase is an especially interesting target. It supports RDF exports
and shares much with triplestores conceptually, but its design allows
users to work with graph-structured data without requiring deep
knowledge of RDF or SPARQL. While the WikibaseR
package is
no longer maintained, we intend to co-develop wbdataset
and
a new WikibaseR
to restore seamless integration with
Wikibase.
Below is a dataset of musical artists from small countries. It
demonstrates how dataset
handles dataset-level metadata
(e.g. rights, authorship, provenance) and row-level annotations without
requiring external formats or technologies.
small_country_musicians <- data.frame(
qid = c("Q275912", "Q116196078"),
artist_name = defined(
c("Marta Roure", "wavvyboi"),
concept = "https://www.wikidata.org/wiki/Property:P2093"
),
location = defined(
c("Andorra", "Lichtenstein"),
concept = "https://www.wikidata.org/wiki/Property:P276"
),
date_of_birth = defined(
c(as.Date("1981-01-16"), as.Date("1998-04-28")),
concept = "https://www.wikidata.org/wiki/Property:P569"
)
)
small_country_musicians$age <- defined(
2024 - as.integer(
substr(
as.character(small_country_musicians$date_of_birth),
1, 4
)
),
label = "Age in 2024"
)
This dataset includes semantic definitions (linked to Wikidata
properties), automatically tracks provenance, and can be exported using
any of the supported tools. By embedding rich metadata directly in R
objects, the dataset
package ensures that
publication-readiness and reuse potential are built in from the
start.