---
title: "Onboarding a new PUMF"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Onboarding a new PUMF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = nzchar(Sys.getenv("COMPILE_VIG_CANPUMF"))
)
```

This vignette explains how to bring a Statistics Canada PUMF that `canpumf`
does not yet know about into the package: a brand-new release year of a survey
already covered (e.g. the next cycle of the Canadian Housing Survey the day it
ships), or an entirely new survey series.

The happy path is "drop the files in the right place and call `get_pumf()`".
When that is not enough, the second half of this vignette walks through the
deliberate, low-risk workflow for figuring out the configuration a survey needs
— parse the metadata first, borrow a registry entry from a related survey, tweak
until the parse is clean, and only then build the full table.

```{r setup}
library(canpumf)
library(dplyr)
```

---

## Naming conventions and where to put the files

Everything lives under a single cache directory. Set it once, ideally in your
`.Rprofile`, so data persists across sessions:

```{r cache-path, eval = FALSE}
options(canpumf.cache_path = "~/data/pumf.data")
```

Without this option the cache falls back to `tempdir()` and is discarded at the
end of the session.

The cache is organised strictly by **series** and **version**:

```
<cache_path>/
  <series>/
    <version>/
      <original>.zip              # the StatCan download, retained
      <series>_<version>.duckdb   # built by canpumf
      metadata/                   # canonical CSVs, written by canpumf
        variables.csv
        codes.csv
        layout.csv                # only for fixed-width data
```

The two names you choose — `<series>` and `<version>` — are the identifiers you
will pass to `get_pumf(series, version)`. A few conventions matter:

* **`series`** is the survey acronym used throughout the package: `"SHS"`,
  `"SFS"`, `"CHS"`, `"GSS"`, `"ITS"`, `"SGVP"`, `"Census"`, `"LFS"`, …
  Use the existing acronym for a new year of an existing survey so it joins its
  siblings. For a genuinely new survey pick a short, stable acronym.
* **`version`** is normally the four-digit reference year (`"2023"`). Sticking to
  a bare year is what unlocks the smart fallback described below. Census uses
  multi-part versions (`"2021 (individuals)"`) and LFS accumulates all versions
  into one shared database — those are special-cased and documented in their own
  vignettes.

To deposit a PUMF manually, create the version directory and drop the StatCan
zip into it (extracted files are fine too — `canpumf` will unzip nested archives
on first use):

```{r deposit, eval = FALSE}
dir.create("~/data/pumf.data/CHS/2025", recursive = TRUE)
file.copy("~/Downloads/2025.zip", "~/data/pumf.data/CHS/2025/")
```

For surveys that `canpumf` knows how to download, you can skip the manual copy
entirely and let `get_pumf()` fetch the zip (see
`list_canpumf_collection()` / `list_pumf_registry()`). Manual deposit is for
the case where the file is brand-new, EFT-only, or otherwise not yet wired into
the collection table.

---

## Smart defaults and the newest-sibling fallback

A survey's configuration lives in the **registry** (`R/registry.R`): which file
in the directory is the data file (`file_mask`), how the bootstrap-weight file
joins (`bsw_*`), encodings, and per-variable fixups. Surveys with no registry
entry fall back to **auto-detection**, which inspects the directory and picks the
data file and metadata parser heuristically.

Two design choices make new years onboard with little or no work:

1. **Generic year masks.** Recent entries use a `\d{4}` placeholder instead of a
   hard-coded year, e.g. the CHS data file mask is `chs\d{4}ecl_pumf\.csv` rather
   than `chs2022ecl_pumf\.csv`. Because each version directory contains exactly one
   year's files, the generic pattern resolves unambiguously and a new year needs
   no edit to the mask.

2. **Newest-sibling config inheritance.** When you request a bare-year version
   that has *no* registry entry, and the same series has at least one other
   year-keyed entry, `pumf_registry_lookup()` automatically reuses the config of
   the newest sibling whose year is at or before the one you asked for (or the
   oldest sibling if your year predates them all). It prints a one-time message so
   the reuse is visible:

```{r inherit, eval = FALSE}
# Suppose CHS/2025 has just been released and is not in the registry yet.
tbl <- get_pumf("CHS", "2025")
#> No CHS/2025 registry entry; inheriting config from CHS/2022. Verify the new
#> release matches (file layout, codes, BSW join) and add an explicit entry if
#> it differs.
```

If the 2025 release follows the same naming and layout as 2022 — which is the
common case — this just works: the generic mask finds `chs2025ecl_pumf.csv`, the
BSW join is wired the same way, and the metadata is parsed from the command files
shipped *inside the 2025 download itself* (metadata is always per-version; it is
never borrowed from a sibling, because variable positions and codes drift between
cycles).

Inheritance is deliberately **not** silent and **not** a substitute for an
explicit entry. Treat the message as a prompt to confirm the new release really
matches, and to add a proper `CHS/2025` entry once you have. Inheritance is
skipped for multi-part Census versions and for LFS.

So the first thing to try with any new bare-year release is simply:

```{r try-it, eval = FALSE}
tbl <- get_pumf("CHS", "2025")
tbl |> label_pumf_columns() |> head() |> collect()
```

If the columns look right and the labels resolve, you are done — go to
[section 4](#promote) to make it permanent.

---

## When the automatic import fails

If `get_pumf()` errors, picks the wrong file, leaves columns unlabeled, or the
weights do not join, stop and work the problem in stages. The golden rule is to
**get the metadata parsing right before building the full table**. Parsing is
cheap and idempotent; building scans the whole data file.

### See what is actually in the directory

```{r inspect, eval = FALSE}
vdir <- file.path(getOption("canpumf.cache_path"), "NEWSURVEY", "2025")
list.files(vdir, recursive = TRUE)
```

Look for: the data file (`.txt`/`.dat` for fixed-width, `.csv` for delimited),
the command/codebook files that describe it (`.sps`, `.sas`, `.lay`/`.lbe`,
`*codebook.csv`, `*Dictionary.pdf`, `.sav`), and any bootstrap-weight file
(often `*bsw*`). The names tell you which parser will fire and what `file_mask`
needs to select.

### Parse the metadata in isolation

`pumf_metadata()` runs only the locate + parse stages and returns the canonical
metadata (`variables`, `codes`, `layout`) without building the DuckDB table. This
is your fast feedback loop:

```{r meta, eval = FALSE}
meta <- pumf_metadata("NEWSURVEY", "2025")
str(meta$variables)   # one row per variable: name, label_en, label_fr, type, ...
str(meta$codes)       # one row per code value: name, val, label_en, label_fr
str(meta$layout)      # fixed-width column ranges (absent for CSV data)
```

Good signs: every data column appears in `variables`; categorical variables have
rows in `codes`; for fixed-width data, `layout` has sensible `start`/`end`
positions. Common problems and the registry field that fixes each:

* **Wrong file chosen / "no data file found"** → set `file_mask`.
* **Several command files, parser picks the wrong one** → set `layout_mask` to
  disambiguate the SPSS/SAS files.
* **Garbled accents in French labels / read errors** → set `metadata_encoding`
  (default `"CP1252"`; older DOS-era files sometimes need `"CP850"`, some recent
  ones `"UTF-8"`).
* **Labels missing entirely** (DATA LIST-only SPSS) → supply a
  `*Dictionary.pdf` in the directory so the PDF parser can fill them in.

### Start from an existing entry as a template

You rarely start from a blank slate. Inspect a related entry — an earlier year of
the same survey, or a different survey with the same file format — and use it as
the starting point:

```{r template, eval = FALSE}
pumf_registry("CHS", "2022")   # inspect a known entry
list_pumf_registry()           # see everything that is registered
```

Build a candidate configuration with `pumf_registry_entry()`. Only the fields you
supply are recorded; the rest fall back to pipeline defaults. You pass it to
`pumf_metadata()` / `get_pumf()` via the `registry =` argument **without touching
`R/registry.R`** — perfect for iterating:

```{r entry, eval = FALSE}
entry <- pumf_registry_entry(
  file_mask     = "PUMF_NEWSURVEY_\\d{4}\\.txt",  # generic year from the start
  bsw_file_mask = "bsw_flatfile\\.txt",
  bsw_join_key  = "CASEID"
)

# Re-parse with the candidate config until the metadata looks right:
meta <- pumf_metadata("NEWSURVEY", "2025", registry = entry, refresh = TRUE)
```

Note `refresh = TRUE`: parsing is idempotent, so once `metadata/` exists a new
`registry` has no effect until you force a re-parse.

If you are adapting a survey whose format matches an existing one, literally copy
that survey's entry fields into `pumf_registry_entry()` and adjust the masks.
This is exactly how a new CHS year reuses the CHS/2022 shape, and how a new
survey might start from the configuration of an existing one that shares its
file format.

### Tweak fixups for data-level issues

Once the structure parses, the build stage may still need per-variable
adjustments. These go in `data_fixups` (see `?pumf_registry_entry` for the full
list):

* `force_numeric` — a continuous variable carrying top-code/boundary labels;
  drops the spurious codes but first converts true-missing sentinels to an `NA`
  range.
* `force_character` / `force_integer` / `force_bigint` — override the DuckDB
  storage type so geographic codes keep leading zeros, or out-of-range IDs are
  not truncated.
* `na_values` — raw string values that should become `NA` across all columns
  (undeclared sentinels, SAS-style `"."`).
* `codes_supplement` / `missing_supplement` — inject code rows or missing ranges
  that the command files omit.

```{r fixups, eval = FALSE}
entry <- pumf_registry_entry(
  file_mask   = "PUMF_NEWSURVEY_\\d{4}\\.txt",
  data_fixups = list(
    force_numeric  = "INCOME",
    force_character = "GEOCODE"
  )
)
```

### Build the full table

When the metadata is clean and the fixups are in place, do the full build:

```{r build, eval = FALSE}
tbl <- get_pumf("NEWSURVEY", "2025", registry = entry, refresh = TRUE)

tbl |> label_pumf_columns() |> head() |> collect()   # spot-check labels
tbl |> count() |> collect()                           # row count sanity check
bsw_info(tbl)                                         # confirm weights joined
```

Verify a few known values against the official documentation: a categorical
variable's levels, a continuous variable's range, the total row count, and that
the bootstrap weights are present and join 1:1.

---

## Promote the configuration into the registry {#promote}

A `registry =` patch only lasts for the session. Once it works, make it permanent
by cloning the **canpumf** repo, adding an entry to `R/registry.R` so plain `get_pumf("NEWSURVEY", "2025")`
works for everyone and making a pull request to merge this into the official package:

```{r promote, eval = FALSE}
# In R/registry.R, inside the .pumf_registry list:
newsurvey.2025  = .make_entry("NEWSURVEY", "2025",
  file_mask     = "PUMF_NEWSURVEY_\\d{4}\\.txt",   # keep the generic year
  bsw_file_mask = "bsw_flatfile\\.txt",
  bsw_join_key  = "CASEID")
```

Alternatively [open an issue](https://github.com/mountainMath/canpumf/issues) and document your successful registry modification and the package maintainers will add it to the package.

Use the generic `\d{4}` year mask so the next release year inherits cleanly via
the newest-sibling fallback. When you add the entry, keep the three sources of
truth in sync: the registry, the test suite (`tests/testthat/`,
`tests/TEST_COVERAGE.md`), and the README verified-datasets table.

Finally, **every manual override must be verified against the survey's official
documentation and recorded** in `tests/testthat/override_verification.csv`. The
`test-override-verification.R` test fails if an override is missing from the
ledger or marked `pending`/`mismatch`. The workflow for confirming overrides
against the PDF codebook is described in the package's `CLAUDE.md` and driven by
`tools/verify_overrides.R`.

---

## Summary

* Put the PUMF at `<cache_path>/<series>/<version>/`, using the survey acronym
  and a bare four-digit year.
* Try `get_pumf(series, version)` first — generic year masks plus newest-sibling
  inheritance often handle a new year with no code change (watch for the
  inheritance message).
* If it fails, iterate with `pumf_metadata(..., registry = pumf_registry_entry(...))`
  starting from a related entry, fixing the parse before building.
* Build with `get_pumf(..., registry = ...)`, spot-check labels/rows/weights.
* Promote the working config into `R/registry.R`, sync tests + README, and record
  any overrides in the verification ledger.
