Motivation

“A dataset is an identifiable collection of data available for access or download in one or more formats.” ISO/IEC 20546

library(dataset)

The dataset package is designed to enhance the semantic richness of R datasets, particularly those structured as data.frame or tibble objects. Its goal is to promote reusability and interoperability by enabling tidy datasets to carry rich metadata — from the start.

Through practical iterations, it became evident that the structure of a dataset cannot be separated from its purpose. As a result, the dataset package has evolved into a family of interrelated tools that support semantic clarity at both the column and dataset level.

Problem statement

data.frame(
  geo = c("LI", "SM"),
  CPI = c("0.8", "0.9"),
  GNI = c("8976", "9672")
)
#>   geo CPI  GNI
#> 1  LI 0.8 8976
#> 2  SM 0.9 9672

At first glance, this dataset appears tidy — columns are variables, rows are observations. But is it interpretable?

A tidy structure doesn’t guarantee semantic clarity. Without metadata, every column is open to misinterpretation.

options(sciphen = 4)
small_country_dataset <- dataset_df(
  country_name = defined(
    c("AD", "LI"),
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(
    c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  population = defined(
    c(77543, 40015),
    label = "Population",
    concept = "http://data.europa.eu/bna/c_f2b50efd"
  ),
  dataset_bibentry = dublincore(
    creator = person(given = "Jane", family = "Doe"),
    title = "Small Country Dataset",
    publisher = "Reprex"
  )
)
small_country_dataset$gdp_capita <- defined(
  small_country_dataset$gdp * 1e6 / small_country_dataset$population,
  unit = "dollar",
  label = "GDP Per Capita"
)

The interoperability and future reusability of data depends on the quality and presence of metadata — not just the structure of the data itself. The dataset package captures metadata as seamlessly and non-intrusively as possible.

It covers:

Limits on reusability

Let’s take a look at the limitations of two, tidy datasets from an interoperability and reusability point of view:

library(tibble)

# Dataset D (GDP in billions of USD)
D <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
  country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
  GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)

# Dataset E (GDP in billions of EUR)
E <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
  country_name = c("United States", "France", "United States", "France", "United States", "France"),
  GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)

These datasets are tidy, and from a technical perspective, we can easily combine them:

full_join(D, E)
#> Joining with `by = join_by(year, geocode, country_name, GDP)`
#> # A tibble: 12 × 4
#>     year geocode country_name    GDP
#>    <dbl> <chr>   <chr>         <dbl>
#>  1  2020 USA     United States 21000
#>  2  2020 CAN     Canada         2000
#>  3  2021 USA     United States 22000
#>  4  2021 CAN     Canada         2100
#>  5  2022 USA     United States 23000
#>  6  2022 CAN     Canada         2200
#>  7  2020 USA     United States 18000
#>  8  2020 FRA     France         2500
#>  9  2021 USA     United States 19000
#> 10  2021 FRA     France         2600
#> 11  2022 USA     United States 20000
#> 12  2022 FRA     France         2700

This join is syntactically valid — but semantically misleading.

Anyone reusing the dataset (especially if it’s saved to CSV or shared without code comments) could misinterpret it — perhaps fatally in an analysis.

Even in R, the metadata (e.g. currency) is detached from the values. This problem worsens when datasets are consumed by users who do not read R code or access the original documentation. Worse still, variables like CPI might refer to the Consumer Price Index or the Corruption Perceptions Index — and nothing in the dataset guarantees which one was meant.

Enforcing Semantics

The dataset package introduces defined() vectors, which carry metadata like units, variable labels, namespaces, and definitions. When combining such vectors, this metadata is checked for compatibility.

You cannot concatenate GDP vectors that are denominated in different currencies:

Our strictly defined vectors check if the variable label, unit, namespace and definition match. If they are missing, the concatenation is possible, but when they are present, they must match.

You cannot concatenate USD and EUR denominated values:

GDP_D <- defined(c(21000, 2000, 22000, 2100, 23000, 2200), unit = "M_USD")
GDP_E <- defined(c(18000, 2500, 19000, 2600, 20000, 2700), unit = "M_EUR")
c(GDP_D, GDP_E)
message("Error in c.haven_labelled_defined(GDP_D, GDP_E) :
  c.haven_labelled_defined(x,y): x,y must have no unit or the same unit")
#> Error in c.haven_labelled_defined(GDP_D, GDP_E) :
#>   c.haven_labelled_defined(x,y): x,y must have no unit or the same unit

Concatenation is only allowed when metadata , in this case, the unit of measure, or more precisely, the monetary unit matches:

GDP_F <- defined(c(21000, 2000, 22000, 2100, 23000, 2200), unit = "M_EUR")
GDP_E <- defined(c(18000, 2500, 19000, 2600, 20000, 2700), unit = "M_EUR")
c(GDP_F, GDP_E)
#> x
#> Measured in M_EUR 
#>  [1] 21000  2000 22000  2100 23000  2200 18000  2500 19000  2600 20000  2700

Reusability from the Start

Many metadata packages in R aim to enrich datasets after they’re built — when preparing them for publication, export, or external documentation. In contrast, the dataset package embeds semantics at the moment of dataset creation.

This early intervention:

This design aligns with the European Interoperability Framework (EIF), which outlines four levels of interoperability:

  1. Legal

  2. Organisational

  3. Semantic

  4. Technical

While many R packages address the technical layer (e.g. through standards-based file formats), and some touch on legal metadata, the dataset family places emphasis on the organisational and semantic layers — capturing meaning, context, and responsibility from the inside out.

Tidy Data and Graphs

While tidy tabular data and graph-based data may share similar metadata needs, their workflows — and the assumptions they carry — differ considerably.

Carl Boettiger and others have shown how tidy data frames can be represented as RDF triples. In fact, serialising long-format tabular data into RDF is now a common approach for semantic data exchange. However, the origin of the data — whether from a graph, tabular database, or API — strongly influences its semantic shape and metadata requirements.

To support this diversity of reuse, the dataset package is complemented by two specialised extensions:

datacube: Statistical Data and Metadata Exchange

This planned package focuses on interoperability with SDMX (Statistical Data and Metadata eXchange), a standard adopted by statistical offices such as Eurostat and the World Bank.

wbdataset: Knowledge Graph Compatibility

This evolving package supports interoperability with Wikidata and the Wikibase Data Model, which underpin the world’s largest open knowledge graph.

rdflib: Native Graph Integration

For graph-first use cases, the excellent rdflib package provides RDF parsing and serialisation within R. dataset is fully compatible and designed to work with rdflib, allowing metadata-enriched dataset_df objects to serve as staging grounds for RDF generation.

Our Approach: Metadata from the Start

The dataset package is based on a simple principle:

Data and metadata should live together.

This is achieved by using R’s powerful (but underused) attributes() mechanism to store metadata directly on vectors and data frames. Attributes are preserved in .rds and .rda formats, ensuring metadata travels with the data through most R workflows.

D_enriched <- dataset_df(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
  country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
  GDP = c(21000, 2000, 22000, 2100, 23000, 2200),
  dataset_bibentry = dataset::dublincore(
    title = "North American GDP Dataset",
    description = "Dataset containing GDP data for North American countries.",
    creator = person("Daniel", "Antal"), # Replace with the actual creator
    publisher = "Reprex",
    dataset_date = Sys.Date(),
    subject = "GDP"
  )
)

D_enriched
#> Antal (2025): North American GDP Dataset [dataset]
#>   rowid      year geocode country_name    GDP 
#>   <defined> <dbl> <chr>   <chr>         <dbl>
#> 1 eg:1       2020 USA     United States 21000
#> 2 eg:2       2020 CAN     Canada         2000
#> 3 eg:3       2021 USA     United States 22000
#> 4 eg:4       2021 CAN     Canada         2100
#> 5 eg:5       2022 USA     United States 23000
#> 6 eg:6       2022 CAN     Canada         2200

Other packages (like dataspice) also aim to improve metadata capture. But many of them:

In contrast, dataset stores metadata in the object itself, alongside the data, from the moment of creation. This ensures:

The enriched dataset gives us more information about the organisation of the dataset as a whole, but still suffers from a loose definition of its structural elements, i.e., the rows and the columns. We will come back to this in the following sections.

The dataset package provides functions with sensible defaults and structures compliant with international metadata standards. The dublincore() and datacite() functions create enhanced bibentry objects and offer many functions to change the elements of these objects (i.e., to change the publication year, the names of the contributors, etc.) These functions facilitate adding metadata about the dataset as a whole (i.e., the overall organization of the data, not individual rows or columns). It also establishes a foundation for adding metadata about the internal structure of the dataset, such as the meaning and intended use of rows and columns.

Currently, the dataset package’s approach to row/column-level metadata differs depending on whether the data is intended for a “datacube” or “wbdataset” (likely referring to a specific data format or use case). This is because datasets reused in tabular or graph forms have distinct semantic requirements, which necessitate different function interfaces even if the underlying metadata needs are similar. Future development may address these differences more uniformly.

Keep data and metadata together

The dataset package shares similar goals with the rOpenSci dataspice package, particularly concerning tabular data. dataspice focuses on making datasets more discoverable online by adding Schema.org metadata, a lightweight ontology used by web search engines. A key interoperability goal for dataset is to capture all the metadata that dataspice utilizes throughout the analytical workflow and store it directly within the R data.frame’s attributes.

While dataspice encourages users to provide this metadata separately in a CSV file (a convenient and simple approach), this method has two critical weaknesses:

  1. User error: Users may not complete the CSV file or might enter incorrect information.
  2. Synchronization issues: The released dataset and its metadata can become detached, lost, or fall out of sync due to later updates.

Knowledge graphs offer a more robust solution by updating data and metadata simultaneously, ensuring consistency and enforcing strict metadata definitions. The dataset package aims to promote continuous metadata collection and storage within the R object itself, saving it alongside the data. R’s ability to store rich metadata in attributes makes saving in .rda or .rds files a compelling option, even if these formats aren’t universally interoperable.

Inspired by knowledge graphs, dataset extends the use of attributes to include graph-like metadata, such as recording the dataset’s provenance: who downloaded the original data, who performed manipulations, which R packages (“software agents”) were used, and so on.

From tidy to tidier

Tidy data principles, well-established among R users, promote data reusability through a simple structure: rows represent observations, and columns represent variables with meaningful names. This aligns with the 3NF normal form in relational databases and, as Carl Boettiger illustrates, can also represent graph entries in a 3-column long format (subject–predicate–object). However, two key limitations require going beyond standard tidy principles:

While the haven and labelled packages in the tidyverse provide enhanced column labels (similar to SPSS and STATA), they are insufficient for complex workflows involving numerous datasets. dataset addresses these limitations in two ways:

The defined() class also allows extending tidyverse row IDs with a namespace or prefix, enabling conversion to globally unique identifiers (GUIDs), critical when working with data from many sources.

Beyond row and column identification, dataset addresses the question of when to define a new dataset, a challenge only partially answered in earlier versions and now delegated to the datacube package. This question hinges on the concept of a datacube, a structural enhancement of tidy data. Datacubes categorize variables into:

library(tibble)

# Dataset D (GDP in billions of USD)
D <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
  country_name = c(
    "United States", "Canada", "United States",
    "Canada", "United States", "Canada"
  ),
  GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)

# Dataset E (GDP in billions of EUR)
E <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
  country_name = c("United States", "France", "United States", "France", "United States", "France"),
  GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)

joined_data <- dplyr::full_join(D, E)
#> Joining with `by = join_by(year, geocode, country_name, GDP)`
joined_data
#> # A tibble: 12 × 4
#>     year geocode country_name    GDP
#>    <dbl> <chr>   <chr>         <dbl>
#>  1  2020 USA     United States 21000
#>  2  2020 CAN     Canada         2000
#>  3  2021 USA     United States 22000
#>  4  2021 CAN     Canada         2100
#>  5  2022 USA     United States 23000
#>  6  2022 CAN     Canada         2200
#>  7  2020 USA     United States 18000
#>  8  2020 FRA     France         2500
#>  9  2021 USA     United States 19000
#> 10  2021 FRA     France         2600
#> 11  2022 USA     United States 20000
#> 12  2022 FRA     France         2700

The datacube package emphasizes the distinction between these variable types. For example, joining datasets with shared row IDs and columns isn’t meaningful without considering units of measure (e.g., euros vs. dollars). Dimensions determine when a new dataset is needed. dataset, through the defined() class, allows specifying units of measure, while datacube handles the broader needs of attributes, measures, and dimensions.

From tabular to graph data

The rOpenSci rdflib package, a wrapper for the Python library of the same name, provides powerful tools for reading and writing graph data. We believe dataset and rdflib are complementary and should be used together. While there’s a small overlap (inspired by an internal rdflib function for working with NQuads, a form of RDF), we’ve avoided making rdflib a direct dependency of dataset. However, for users working with RDF-annotated graphs, we strongly recommend using dataset in conjunction with rdflib for importing, exporting, and exchanging data.

First, let us see how we would solve the ambiguity problems without dataset, relying only on tidyverse and rdflib.

library(tibble)

# Dataset D (GDP in billions of USD)
D <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
  country_name = c(
    "United States", "Canada", "United States",
    "Canada", "United States", "Canada"
  ),
  GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)

# Dataset E (GDP in billions of EUR)
E <- tibble(
  year = c(2020, 2020, 2021, 2021, 2022, 2022),
  geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
  country_name = c(
    "United States", "France", "United States",
    "France", "United States", "France"
  ),
  GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)

# Combine datasets
combined <- bind_rows(D, E)

# Convert all to character *before* pivoting
combined_char <- combined %>%
  mutate(across(everything(), as.character))

# Rename geocode to subject
combined_subject <- combined_char %>%
  rename(subject = geocode)

# Pivot longer, excluding the subject (triples are named, but predicates are not renamed yet)
named_triples <- combined_subject %>%
  pivot_longer(
    cols = c(year, country_name, GDP),
    names_to = "predicate",
    values_to = "object"
  )

# Replace geocodes with Geonames IDs and add datatype annotations
geonames_mapping <- tibble(
  subject = c("USA", "CAN", "FRA"),
  geonames_id = c("6252001", "6251999", "2988317")
)

rdf_triples <- named_triples %>%
  left_join(geonames_mapping, by = "subject") %>%
  mutate(
    subject = paste0("geonames:", geonames_id),
    predicate = case_when(
      predicate == "year" ~ "sdmx:TIME_PERIOD",
      predicate == "country_name" ~ "schema:name",
      predicate == "GDP" ~ "sdmx:OBS_VALUE",
      TRUE ~ predicate
    ),
    object = case_when(
      predicate == "sdmx:TIME_PERIOD" ~ paste0("\"", object, "\"^^<http://www.w3.org/2001/XMLSchema#gYear>"),
      predicate == "sdmx:OBS_VALUE" ~ paste0("\"", object, "\"^^<http://www.w3.org/2001/XMLSchema#double>"),
      TRUE ~ object
    )
  ) %>%
  select(subject, predicate, object)

print("RDF Triples (subject, predicate, object):")
#> [1] "RDF Triples (subject, predicate, object):"
print(rdf_triples)
#> # A tibble: 36 × 3
#>    subject          predicate        object                                     
#>    <chr>            <chr>            <chr>                                      
#>  1 geonames:6252001 sdmx:TIME_PERIOD "\"2020\"^^<http://www.w3.org/2001/XMLSche…
#>  2 geonames:6252001 schema:name      "United States"                            
#>  3 geonames:6252001 sdmx:OBS_VALUE   "\"21000\"^^<http://www.w3.org/2001/XMLSch…
#>  4 geonames:6251999 sdmx:TIME_PERIOD "\"2020\"^^<http://www.w3.org/2001/XMLSche…
#>  5 geonames:6251999 schema:name      "Canada"                                   
#>  6 geonames:6251999 sdmx:OBS_VALUE   "\"2000\"^^<http://www.w3.org/2001/XMLSche…
#>  7 geonames:6252001 sdmx:TIME_PERIOD "\"2021\"^^<http://www.w3.org/2001/XMLSche…
#>  8 geonames:6252001 schema:name      "United States"                            
#>  9 geonames:6252001 sdmx:OBS_VALUE   "\"22000\"^^<http://www.w3.org/2001/XMLSch…
#> 10 geonames:6251999 sdmx:TIME_PERIOD "\"2021\"^^<http://www.w3.org/2001/XMLSche…
#> # ℹ 26 more rows

The frictionless package has a somewhat similar approach, but it does not support either a metadata standard or a strict serialisation standard.

Both rdflib and frictionless provide strong foundations for exporting datasets, but they differ in scope and assumptions. The rdflib package utilises the World Wide Web RDF standard, which is also used by SDMX and many open science repositories. It supports multiple serialisation forms, including JSON-LD. By contrast, frictionless relies on JSON containers but lacks a formal ontology layer.

Both packages can preserve technical metadata, but neither guides users toward a consistent semantic model. They provide the vehicle, but not the fuel.

The dataset package — especially through its downstream extensions like datacube and wbdataset — aims to fill this gap, offering metadata curation aligned with high levels of organisational and semantic interoperability. datacube focuses on SDMX-style statistical datasets, while wbdataset helps users adopt the Wikibase data model and workflow.

Unlike rdflib, which requires RDF/OWL expertise, and frictionless, which delegates semantics to arbitrary field descriptions, dataset provides structure and guidance without requiring users to leave the R environment.

Publishing the data

Let us return to the ISO/IEC 20546 definition used as a motto:
“A dataset is an identifiable collection of data available for access or download in one or more formats.”

Datasets should be published with their future reuse in mind. Several R packages support dataset publication, each targeting different use cases and levels of interoperability:

The dataset package is designed to be compatible with all of these tools. It enables seamless export to:

Although frictionless is currently not a direct target, it can be supported if user demand arises.

Wikibase is an especially interesting target. It supports RDF exports and shares much with triplestores conceptually, but its design allows users to work with graph-structured data without requiring deep knowledge of RDF or SPARQL. While the WikibaseR package is no longer maintained, we intend to co-develop wbdataset and a new WikibaseR to restore seamless integration with Wikibase.

Below is a dataset of musical artists from small countries. It demonstrates how dataset handles dataset-level metadata (e.g. rights, authorship, provenance) and row-level annotations without requiring external formats or technologies.

small_country_musicians <- data.frame(
  qid = c("Q275912", "Q116196078"),
  artist_name = defined(
    c("Marta Roure", "wavvyboi"),
    concept = "https://www.wikidata.org/wiki/Property:P2093"
  ),
  location = defined(
    c("Andorra", "Lichtenstein"),
    concept = "https://www.wikidata.org/wiki/Property:P276"
  ),
  date_of_birth = defined(
    c(as.Date("1981-01-16"), as.Date("1998-04-28")),
    concept = "https://www.wikidata.org/wiki/Property:P569"
  )
)

small_country_musicians$age <- defined(
  2024 - as.integer(
    substr(
      as.character(small_country_musicians$date_of_birth),
      1, 4
    )
  ),
  label = "Age in 2024"
)

This dataset includes semantic definitions (linked to Wikidata properties), automatically tracks provenance, and can be exported using any of the supported tools. By embedding rich metadata directly in R objects, the dataset package ensures that publication-readiness and reuse potential are built in from the start.

mirror server hosted at Truenetwork, Russian Federation.