quak provides convenient utilities for using DuckDB with
datasets stored in Azure Data Lake Storage Gen2 (abfss://).
It opens connections configured for Azure-backed Delta Lake data,
registers Azure credentials as DuckDB secrets, and works with both SQL
via DBI and lazy table queries via dplyr and dbplyr.
quak is not yet on CRAN. Install the development version
from GitHub:
# install.packages("remotes")
remotes::install_github("pedrobtz/quak")Start by opening an Azure-ready DuckDB connection. Then register one Azure secret on that connection.
library(quak)
conn <- az_conn()Use a credential chain when you want DuckDB to resolve credentials itself, for example from the Azure CLI:
library(quak)
conn <- az_conn()
az_set_chain_secret(conn, chain = "cli")Use an access token when another package obtains the token for you:
library(quak)
az_creds <- azr::DefaultCredential$new(
scope = az_default_scope()
)
conn <- az_conn()
az_set_token_secret(conn, token = az_creds$get_token()$access_token)Use a service principal when you want to pass the application credentials directly:
library(quak)
conn <- az_conn()
az_set_sp_secret(
conn,
tenant_id = Sys.getenv("AZURE_TENANT_ID"),
client_id = Sys.getenv("AZURE_CLIENT_ID"),
client_secret = Sys.getenv("AZURE_CLIENT_SECRET")
)All secret helpers also accept
account = "storageaccount" to scope the secret to one
storage account.
Use load_delta() when you want to register a Delta table
in DuckDB and query it with SQL.
library(quak)
conn <- az_conn()
az_set_chain_secret(conn, chain = "cli")
load_delta(
conn = conn,
url = "abfss://container@account/path/sales",
name = "sales",
method = "view"
)
DBI::dbGetQuery(
conn,
"SELECT COUNT(*) AS n, AVG(amount) AS avg_amount
FROM sales
WHERE amount > 100"
)
DBI::dbDisconnect(conn, shutdown = TRUE)Use tbl_delta() when you want to work with a Delta table
through dplyr/dbplyr.
library(quak)
conn <- az_conn()
az_set_chain_secret(conn, chain = "cli")
sales <- tbl_delta(conn, "abfss://container@account/path/sales")
sales |>
dplyr::filter(amount > 100) |>
dplyr::summarise(avg_amount = mean(amount, na.rm = TRUE)) |>
dplyr::collect()
DBI::dbDisconnect(conn, shutdown = TRUE)