The first CRAN release and rOpenSci review brought very useful experience and feedback. The dataset package had been defined with a very broad requirement. While the very general requirement setting has advantages, a clear disadvantage is that without a specific use case, it is difficult to raise enough user and co-developer interest.
In line with some of the legitimate criticism of version 0.1.0—0.3.1, I envision a dataset package that has more specific inheritance packages that work with datasets in a more specific use case.
It is probably too wide of a claim to create a package that will
bring the base R data.frame object in line with any disciplinary
requirements of datasets. The original concept closely followed the SDMX
(statistical) dataset definition to the extent that one reviewer
recommended the datacube
name for the package. In light of
some further use experiences, this is a valid criticism because datasets
used in digital humanities, for example, have slightly different
specification needs.
R is primarily a statistical environment and language; therefore, broad conformity with the SDMX statistical metadata standards is desirable. However, the dataset package should remain generic enough to support non-statistical datasets. Along these lines, the current aim is to triangulate three packages:
datacube
package, which follows more closely the SDMX
definition of a dataset and the more general, multi-dimensional datacube
definitionwbdataset
package, which follows more closely the
Wikibase Data Model that is increasingly used for digital collections
management and other scenarios for statistically not aggregated
datasets.dataset
package that is sufficiently generic that
both the datacube
and the wbdataset
package
rely on it as a joint dependency. Therefore, the plan is to relax some
SMDX definitions of datasets that are not very useful in non-statistical
applications. Such functionality can be removed from a later-developed
datacube
package.At the same time, I would like to co-develop the dataset package with
the wbdataset
package because the Wikibase Data Model is a
very well-defined semantic data model that could potentially create a
large enough user base and use case for the entire project.
Another important lesson was that the first version of the dataset package wanted to be so generally usable that it aimed for compatibility for base R data.frames, the tidyverse tibble modernisations of such data frames, and the data.table objects, which have their own user base and dependencies in many statistical applications. While such broad appeal and ambition should not excluded for the future, it would be a too significant undertaking to ensure that all functionality works with data.frames, tibble and data.tables. Whenever this is possible, this should remain so, but new developments should only follow the modern tidyverse tibbles.
Most R-user data scientists are familiar with the term tidy data. It is data ready for tabular analysis and easy to understand. It places measurements (variables) in columns, dividing each observation into a row. Column names convey the meaning of the variable.
You should wrangle your data in a tidy format because that is the prerequisite of most statistical analysis or visualisation with various algorithms, and it gives a logical place to each data point.
Making your data tidy is like hoovering your room. It will make your environment neat, but if you leave for months, it will be full of dustballs, insects and spiders, and you will not remember where you left your slippers. To allow us to return to work after months or years or to pass on our work to others, we need to provide more meaning (semantics) about a tidy dataset.
Consider the following simple data.frame:
data.frame(
geo = c("LI", "SM"),
CPI = c("0.8", "0.9"),
GNI = c("8976", "9672")
)
#> geo CPI GNI
#> 1 LI 0.8 8976
#> 2 SM 0.9 9672
This dataset is tidy. But it certainly could be improved!
geo
: you may figure out that geo means something to
do with geography (maybe countries, “LI” standing for Lichtenstein and
“SM” for San Marino), but even that, who knows? If you add Greece,
should use use “EL” like Eurostat or “GR” like the World Bank?
CPI
can stand for “consumer price index” or
“corruption perceptions index”. Or anything else!
GNI
can stand for “Gross National Income” and
“Global Nutrition Index”, often mentioned in the same context. If there
are numbers, do they mean a physical quantity or a currency? Is the
currency dollars or euros?
Adding such definitions to the dataset makes them far more accessible, reusable, and even findable. Many people may be looking for GNI in US dollars; using a currency cue for searching may bring in a new user.
Most researchers are familiar with the semantic needs to make a dataset findable; they may know how libraries or scientific repositories store authorship data or titles. Our package handles these types of metadata, too, but goes into the semantics of what’s inside the dataset. What does each column mean? What is the identity of the observations in the rows (if they are not anonymous)? Adding such information in a machine-readable way allows far greater findability because people can search for data expressed in euros or dollars only or any datasets that use CPI in the sense of the Corruption Perception Index instead of finding thousands of search hits for CPI as consumer price inflation.
Further enriching the semantics of an already tidy dataset can increase findability and reusability. Even the same user may struggle months or years later with a saved tidy dataset that is missing the definitions or variables of observations, or the unit of measure.
Our dataset package focuses on an R user who would like to work on maintaining, extending or improving the same dataset for many months or years, potentially with several data managers and data sources under joint stewardship.
The magic abbreviation FAIR data stands for findable, accessible, interoperable and reusable; F, A, and R mainly boil down to adding the correct metadata to the data. The European Interoperability Framework as a standard describes four layers of interoperability, so it is a far more elusive concept. In this standard, interoperability has a legal, organisational, semantic, and technical layers.
Technical interoperability can be increased by releasing the dataset in a system-independent format such as JSON. Legal interoperability can be increased by adding a standard re-use rights statement and authorship information about the dataset itself to the release. But considering the R user’s workflows (organisational aspect) and communicating it in a meaningful way (semantic aspect) ensures that a new user can effectively improve the dataset (similarly to adding to the code base of an R package), or use the dataset in an entirely different job. Making sure that the “GNI” is about nutrition and not national income, the “CPI” is about inflation and not corruption is essential to place your dataset into a public health or a macroeconomic analysis workflow.
dataset
package into the R ecosystemMany R packages try to help FAIR qualities with different mindsets and tools. The dataset package differs from them and can work with many of them synergistically, magnifying the value added by other metadata packages.
The frictionless
family is a bit of an alternative to
dataspace
and rdflib
: it supports one
serialisation format, JSON, it heavily relies on Schema.org, and it
provides interoperability within the frictionless ecosystem. The frictionless R
package. It works well in the ecosystems of the Open Knowledge
Foundations data ecosystem.
These packages help release or publish data and make it more findable
and reusable for new users. Our dataset
package can be seen
as boosting the performance or usability of these packages, but it has a
different interoperability platform in sight, the exchange of datasets
themselves either among R users or with platforms that publish datasets
themselves and not text about the contents of the dataset.
While the aforementioned packages mainly help with exporting datasets
from R to a different format, and a different system, and perhaps
outside of the statistical community in HTML, JSON, or XML. The dataset
package focuses on the R user’s data.frames, tibbles, potentially saved
as rds or Rda files, and to collect and add as much well-designed
metadata for interoperability and reuse as possible; of course, such
metadata can be passed on to dataspice
, rdflib
or frictionless
when leaving the R ecosytem.
The dataset
package does not want to make these packages
explicit dependencies, so it offers NQuad serialisation besides CSV to
export data outside of the R system; these Nquads can be translated to
other standard formats with rdflib
. We also recommend the
use of rdflib
for heavy-duty serialisation; but we offer a
lightweight solution to avoid a hard dependency. The rdflib
package is superior in exporting data from the R ecosystem to another
statistical or other scientific use with the W3C standard RDF markup.
For communicating among R users, it is an unnecessary translator that
requires quiet advanced understanding of the RDF metadata language. It
is unnecessary within the R ecosystem if we can convey the same
information within an .rds
or .Rda
file.
Similarly, if you want to communicate the contents of your dataset to
non-R users or using visualisations and text, the dataspice
package is a very good way to that. However, it is inferior for passing
on an R object to another user, because it requires translation at
exporting and re-translation at importing back to R; furthermore, a lot
can be lost in translation as it hardcodes the Schema.org vocabulary
that is not designed for statistical use, but for web publishing.
The dataset
package focuses on the R ecosystem and R
data.frames, and the R user who downloads data from different data
sources like Eurostat, IMF, and World Bank and needs to join such data
meaningful, add further derived own work, and save the result in .rda or
.rds files.
The dataspice
and frictionless
ecosystem
focuses on data users; they are organised around Schema.org, which helps
web content, such as a description of a dataset findable on a usual
browser search.
The dataset
package focuses on data producers and is
optimised for web queries on the web of data layer of the internet that
connects databases. Therefore our default settings and the function
interface focuses on the language of SDMX, the statistical data and
metadata exchange standard, and DCTERMS, which is the main language of
libraries and repositories. (We also support DataCite that is often
preferred in European data repositories instead of the more generic
DCTERMS.) These standards are not so good to find a description and a
visualisation of a dataset with an web text search (Schema.org was
designed for that purpose), but they are far superior for communicating
the contents of a dataset to an other data producer, or somebody who may
have a matching component to the contents of your dataset. Our package
is not intended to those who want to effectively communicate about the
interpretation of a dataset to fellow biologists or economist, but to
users who want to communicate the contents of the dataset to fellow data
collectors or producers.
We are planning a family of dataset packages that help you with data exchanges (receiving, joining, republishing) in different data interoperability scenarios. h dataset package mainly changes the attributes system of R objects. e R users actively use the attributes; we created a reference function with sensible defaults for most statistical and open science standards.
The planned datacube
package will extend the
dataset
package to the full SDMX Datacube specification and
allow you to exchange information or publish statistical data that are
fully interoperable with Eurostat or IMF datasets on statistical data
repositories, for example, on the EU Open Data Portal. On such exchanges
we mean that the users download both the GDP dataset and the population
dataset and send back their own GDP/population calculations in a new or
updated dataset.
The wbdataset
package enables the exchange and
publication of data on Wikidata or Wikibase instances, which offer the
widest-used open data exchange platform for non-statistical data. It
adds the main elements of the Wikibase Data Model that allow you to
exchange information or publish on many open graphs (platforms that will
enable connecting interoperable data sources.) A likely exchange
scenario is that you download the biographical data of famous singers
from a country and add back to some biographies that were not there, but
you have collected them from non-interoperable sources.
A similar extension to the frictionless
ecosystem is
possible, too; should there be a user demand to improve the R-side data
production workflow of data primarily intended into this
ecosystem.
The new dataset package would be streamlined to provide a tidier version of the tidy data definition. “Tidy datasets provide a standardised way to link the structure of a dataset (its physical layout) with its semantics (its meaning).” The aim of the dataset package is to improve the semantic infrastructure of tidy datasets beyond the current capabilities of the tidyverse packages, relaxing the exclusive use of the semantic definitions of the SDMX statistical metadata standards.
The dataset package and its dependencies (or, more specifically, new extensions) will only work with tidy datasets. In the context of tidy data, “A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation.”
Our aim is to improve the semantic description of the observations and the variables to the extent that they can be serialised to any standard format that follows the RDF W3C metadata annotation standard. Such an improvement stipulates the following requirements:
For example, when working with this small dataset:
options(sciphen = 4)
small_country_dataset <- dataset_df(
country_name = defined(
c("AD", "LI"),
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(
c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
),
population = defined(
c(77543, 40015),
label = "Population",
concept = "http://data.europa.eu/bna/c_f2b50efd"
),
dataset_bibentry = dublincore(
creator = person(given = "Jane", family = "Doe"),
title = "Small Country Dataset",
publisher = "Reprex"
)
)
If you want to work with the data in the data.frame, you must know if the GDP is expressed in euros, dollars, thousands, millions or billions, in order to get to a meaningful GDP/capita figure.
summary(small_country_dataset$gdp)
#> Gross Domestic Product (million dollars)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3897 4764 5631 5631 6498 7365
Because the GDP is expressed in millions of dollars, we have to adjust the nominator to dollars to get the GDP per capita express in dollars.
small_country_dataset$gdp_capita <- defined(
small_country_dataset$gdp * 1e6 / small_country_dataset$population,
unit = "dollar",
label = "GDP Per Capita"
)
The var_unit(small_country_dataset$gdp_capita)
command
returns dollar and
var_label(small_country_dataset$gdp_capita)
returns GDP Per
Capita.
In this case,
The metadata definitions of SDMX or open science are helpful, because they differentiate between metadata statements about the entire dataset as a structured information source, and metadata related to specific row/column intersections, where only the row, the column, and potentially the intersection cell value needs a formal definition.
For example, we may want to add as metadata to the entire dataset the time frame, the usage rights, the creator.
We must remain aware that in the absence of an all-encompassing, general ontology of all concepts that may go into a dataset, all datasets will be used in some kind of disciplinary context. A dataset about GDP or population will be read in a socio-economic context; a dataset containing data from sound recordings of musical works will use definitions from a musicological context. While music from Andorra may share some semantic definitions of our tiny socio-economic dataset that observes Andorra’s GDP and population, it is unlikely that the observation and variable definition will lack a broad context.
small_country_musicians <- data.frame(
qid = c("Q275912", "Q116196078"),
artist_name = defined(
c("Marta Roure", "wavvyboi"),
concept = "https://www.wikidata.org/wiki/Property:P2093"
),
location = defined(
c("Andorra", "Lichtenstein"),
concept = "https://www.wikidata.org/wiki/Property:P276"
),
date_of_birth = defined(
c(as.Date("1981-01-16"), as.Date("1998-04-28")),
concept = "https://www.wikidata.org/wiki/Property:P569"
)
)
small_country_musicians$age <- defined(
2024 - as.integer(substr(as.character(small_country_musicians$date_of_birth), 1, 4)),
label = "Age in 2024"
)
Specifically, our first dataset’s general semantic context is about countries as observational units and statistical variables related to them, while the second contains individual musical recordings and observed musicological values of such recordings, a general property of the dataset that can be applied to the entire dataset. Similarly, important provenance information, authorship, use rights, and so on, apply to the dataset, not to its specific cells.
summary(small_country_musicians)
#> Age in 2024
#> qid artist_name location date_of_birth
#> Length:2 Length:2 Length:2 Min. :1981-01-16
#> Class :character Class :character Class :character 1st Qu.:1985-05-12
#> Mode :character Mode :character Mode :character Median :1989-09-06
#> Mean :1989-09-06
#> 3rd Qu.:1994-01-01
#> Max. :1998-04-28
#> age
#> Min. :26.00
#> 1st Qu.:30.25
#> Median :34.50
#> Mean :34.50
#> 3rd Qu.:38.75
#> Max. :43.00
The current version of dataset made efforts to record such information with applying two globally used metadata standards, the Dublin Core terms used by most librarians and perhaps the majority of scientific data repositories, and the DataCite standard preferred by some European repositories (which is more specific to data publication than generally all kinds of publication like the Dublin Core library standards.) The dataset package and the dataset class add such metadata with the help of interface functions to the base R attributes part of the data.frame (or tibble, data.table) object. This is a useful feature that may go through code or interface refinement, but works well and should not significantly change.
The current version of dataset is too specific because it tries to map the SDMX definitions of the data structure, which goes far beyond the needs of a well-described tidy dataset. The datacube statistical vocabulary and standard provides further structural information about how we can subset or slice meaningfully a statistical dataset. While such structural information is indispensable for certain statistical datasets, it may be completely useless when it contains statistically not aggregated data (“microdata”), or collections data as a primary statistical data source.
To demonstrate the long-term ambition, we want to develop the following functionality:
wbdataset
package for a correct format for
collecting data from Wikibase, a semantic database widely used to share
micro-data. This package can later be generalised to work with further
data models that have a similar aim.datacube
package should support the semantic
needs of datasets that contain statistically processed information from
a dataset that was collected via the wbdataset
function.
The semantic needs are different, because the observational units will
change after statistical aggregation.The Wikibase Data Model is a relatively simple and flexible data model. It works with concepts and properties as relationships among concepts. A tidy dataset that applies the Wikibase Data Model can be described (using the definitions of Wikidata, the world’s largest public database created with Wikibase) the following:
The key column defines the data subject or statistical unit with a
QID identifier. This identifier is denoted with a capital Q followed by
an integer number. The QID is unique in one instance of a Wikibase
database. The full identifier contains the URL of this database and the
QID. For example, https://www.wikidata.org/wiki/Q228 unambiguously defines
the small country of Andorra in many natural languages, and with the
connection of various actionable, persistent identifiers. (Check out on
the linked page for example: sovereign microstate between France and
Spain, in Western Europe, or https://www.geonames.org/3041565/ or
https://isni.org/isni/000000012150090X/
.
library(wbdataset)
get_item(
qid = c("Q228", "Q347"),
language = c("en", "nl"),
creator = person("Jane Doe"),
title = "Small Countries"
)
With a wbdataset
object, it is important that the key
column is a QID. Following the notations of tidyverse, instead of the
tibble::rowid_to_colum, we create a`wbdataset::qid_to_column function
that creates an identifier for each row.
The variables or columns bring the observational unit (data subject or statistical subject) into a pre-defined relationship with the cell value. These pre-defined relationships are identified in the Wikibase Data Model with a property or PID identifier. The PID is denoted with a capital P followed by an integer number. The PID is unique in one Wikibase-created database. For example, https://www.wikidata.org/wiki/Property:P569 or simply P569 refers to the date of birth, and P276 denotes a location (of the data subject), and P3629 the age of subject at event.
small_country_musicians <- data.frame(
qid = c("Q275912", "Q116196078"),
label = c("Marta Roure", "wavvyboi"),
P276 = c("Andorra", "Lichtenstein"),
P569 = c(as.Date("1981-01-16"), as.Date("1998-04-28"))
)
## And the age
small_country_musicians$P3629 <- 2024 - as.integer(
substr(small_country_musicians$P569, 1, 4)
)
small_country_musicians
#> qid label P276 P569 P3629
#> 1 Q275912 Marta Roure Andorra 1981-01-16 43
#> 2 Q116196078 wavvyboi Lichtenstein 1998-04-28 26
Recalling the tidy data definition, the cell values may be numbers or “strings for qualitative information”. We extend the possibilities to further options following the RDF standards:
Therefore, the column types should have an unambiguous mapping to RDF:
small_country_musicians <- data.frame(
qid = c("Q275912", "Q116196078"),
label = c("Marta Roure", "wavvyboi"),
P276 = c("Q228", "Q347"),
P569 = c(as.Date("1981-01-16"), as.Date("1998-04-28"))
)
## And the age
small_country_musicians$P3629 <- 2024 - as.integer(
substr(small_country_musicians$P569, 1, 4)
)
small_country_musicians$P569 <- dataset::xsd_convert(small_country_musicians$P569)
small_country_musicians$P3629 <- dataset::xsd_convert(small_country_musicians$P3629)
small_country_musicians
#> qid label P276 P569 P3629
#> 1 Q275912 Marta Roure Q228 "1981-01-16"^^<xs:date> "43"^^<xs:decimal>
#> 2 Q116196078 wavvyboi Q347 "1998-04-28"^^<xs:date> "26"^^<xs:decimal>
We can also add the prefix to any concepts that are well-defined:
small_country_musicians$P276 <- paste0("https://www.wikidata.org/wiki/", small_country_musicians$P276)
small_country_musicians[, 2:4]
#> label P276 P569
#> 1 Marta Roure https://www.wikidata.org/wiki/Q228 "1981-01-16"^^<xs:date>
#> 2 wavvyboi https://www.wikidata.org/wiki/Q347 "1998-04-28"^^<xs:date>
The current dataset
class and functionality should relax
the modelling of the DataStructureDefinition of SDMX because it makes
only sense for statistically aggregated data. Currently, the dataset
package does not support the concept of a slice, and it is a bit awkward
to support multi-dimensional datacubes. Such functionality should be
moved to a low-priority new package.
The specifications of the wbdataset
package should all
be placed into the dataset package whenever it is not specific to the
Wikibase Data Model. They should be co-developed with
wbdataset
to provide a well-defined interface towards a
global data system (Wikidata and its “private clones” of Wikibase
instances.) The wbdataset
package should allow a simple,
natural way to import data from Wikidata or a Wikibase instance, and it
should also provide a simple interface to send data back to such a
semantic database with ease.
The distinction between wb-dataset and dataset is justified because a stripped-down dataset package can still work well in many SDMX or other contexts, albeit without the full functionality of supporting statistical slicing or API support to a specific SDMX-compatible web service. An R package that allows the creation of semantically rich SDMX-compatible datasets with only manual downloading or uploading functionality would still be a great improvement in implementing open science interoperability and reusability for such datasets.