Investigate: bufdir-barnefattigdom — indicator_api_id stability under workbook renames
IMPLEMENTATION RULES: Before implementing any plan from this investigation, read and follow:
- WORKFLOW.md — The implementation process
- PLANS.md — Plan structure and best practices
Status
Completed (2026-05-07) — recommendation (d) accepted and shipped.
Outcome
Recommendation (d) — number-prefix surrogate id (bf_zip_ind_<N>) plus a curated alias seed — landed in PR #71 (PLAN-bufdir-surrogate-id-migration.md, now in completed/). Implementation summary:
parse.ts:surrogateIndicatorApiId()rewritten: parsesIndikator_(\d+[a-z]?)from filename, emitsbf_zip_ind_<N>; SHA-256 path retained as defensive fallback with a warn log if Bufdir ever ships a non-numbered workbook.- Alias seed shipped at
atlas-data/dbt/seeds/sources/bufdir_indicator_alias.csvwithhistorical_id, canonical_id, notecolumns and pre-populatedIndikator 9 → 9a/9bandIndikator 10retirement entries from the observed history. - dbt model lives under
atlas-data/dbt/models/marts/api/bufdir_indicator_alias.sql, soregenerate-api-v1.shauto-emitsapi_v1.bufdir_indicator_alias. Once UIS lands the PostgRESTmarts.*schema-list extension (PLAN-007 Phase 1), the alias surface is queryable on the public API without per-source wiring. - Live state on main:
marts.indicators__bufdir_barnefattigdomcarries 22 distinctbf_zip_ind_<N>ids;marts.bufdir_indicator_aliascarries the historical bridge rows. - README updated with the consumer-pattern note (when to join on the alias for cross-time continuity).
Open questions (from this INVESTIGATE) — resolutions
- Q1 (
5vs5bsame id?) — Resolved as recommended: kept different by default; alias seed bridges editorial-decision continuity. PR #71 alias rows reflect this. - Q2 (alias seed location) — Resolved: lives under
seeds/sources/, alongside_sources_manifest.csv. - Q3 (expose via
api_v1.*?) — Resolved: yes, auto-exposed via the model's placement undermodels/marts/api/. The "auto-exposure over per-source decisions" principle in this INVESTIGATE became repo-wide convention; documented in PLAN-004 generator notes. - Q4 (does this generalise to other ZIP sources?) — Resolved: yes — convention is "alias seed under
seeds/sources/<source>_indicator_alias.csv+ dbt model undermodels/marts/api/". Future ZIP sources copy the pattern; no per-source INVESTIGATE required unless the source has a workbook-stable id of its own.
Related work
- PR #77 (
raw_tables:field onmanifest.yml) lets the catalogue surface bufdir's raw → mart lineage explicitly; this is the metadata trail that lets external consumers see "bf_zip_ind_5 came fromIndikator_5_…xlsx" without needing to re-derive the surrogate-id rule. - PR #67 (parse.ts split + 29 golden-file tests) pinned the parse output before PR #71's id rewrite; the id-strategy change rode on green tests instead of bespoke validation.
Next steps section from this INVESTIGATE
All four next-step items are done:
- ✅ User reviewed and accepted recommendation (d).
- ✅
PLAN-bufdir-surrogate-id-migration.mddrafted, executed, and moved tocompleted/. - n/a — accepted, not deferred.
- ✅ This INVESTIGATE moves backlog/ → completed/ in this housekeeping pass.
Status (original): Backlog
Goal: Decide a stable identity strategy for indicators__bufdir_barnefattigdom.indicator_api_id so consumers of the public API don't see indicators "disappear" and "reappear" when Bufdir renames or splits a workbook upstream. Output: a recommended strategy + scope of any code/data migration, not the implementation itself (a follow-up PLAN-* would do that if the recommendation is more invasive than the current scheme).
Last Updated: 2026-05-05
Origin: PR #60 landed Cursor BG / Composer-2's ZIP-based ingest with indicator_api_id = "bf_zip_" + SHA-256(filename_stem)[:24]. PR #67 hardened discovery + added golden-file tests. The post-merge audit on those two PRs flagged surrogate-id stability as the third (and last) open hardening point — Composer-2 explicitly deferred it. This investigation closes that thread.
The problem in one paragraph
Bufdir's barnefattigdom ZIP exposes one workbook per indicator (Indikator_<N>_<slug>.xlsx). The XLSX export does not carry Bufdir's internal Strapi indicator ids. Atlas's current ingest derives indicator_api_id deterministically from the full filename stem via SHA-256:
function surrogateIndicatorApiId(workbookStem: string): string {
return "bf_zip_" + sha256(workbookStem).slice(0, 24);
}
// "Indikator_5_barn_i_hush_..." → bf_zip_<hash_A>
// "Indikator_5b_barn_i_hush_..." → bf_zip_<hash_B> (different! same indicator, renamed)
This means any change to the filename — adding a year, fixing a typo, renumbering, splitting — changes the id. To consumers of api_v1.indicators_bufdir_barnefattigdom, the old indicator vanishes and a new indicator appears. Cross-time series comparisons break.
We have already observed the failure shape upstream: Indikator 9 was split into 9a and 9b at some point (the bundle currently contains Indikator_9a_… and Indikator_9b_… but no Indikator_9_…). The 1..8, 9a, 9b, 11..22 numbering also tells us Indikator_10 used to exist and was retired — a real-world rename / consolidation event.
Strapi-API-flavored ingests are immune (Strapi keeps stable hex ids across renames). The ZIP path traded that for operational simplicity; this investigation decides what the right substitute is.
What gets compared — the ID's job
indicator_api_id is the join key consumers use to:
- Track an indicator's value series across years (
SELECT … WHERE indicator_api_id = X ORDER BY year). - Cross-reference Atlas catalogue rows in
mart_meta_sources/mart_meta_dimensions(Phase 3 of PLAN-007). - Build derived marts that join bufdir indicators to other sources.
So the strategy must answer: is this id the same indicator as last year, or is it a new indicator? Whatever is upstream-stable enough to anchor that answer.
What's actually inside the workbook
Quick spike against the live ZIP — every workbook's Data sheet has a title block above the header:
| Workbook | Title in pre-header row 0 |
|---|---|
Indikator_17_…leier_bolig_kun_pers.xlsx | Tab. 17: Barn 0-5 år i husholdninger som leier bolig. Kun personer. |
Indikator_4_…sosialhjelp_ila_året.xlsx | Tab. 4: Barn i husholdninger som har mottatt sosialhjelp i løpet av året. |
So the workbook does contain Tab. <N> — the same indicator number as in the filename. Bufdir keeps two redundant copies of the number (filename + sheet body) but no separate stable id.
The only other stable thing inside the workbook is the title text, which Bufdir occasionally rewords (e.g. methodology footnotes, definition refinements) — less stable than the number.
Options
(a) Status quo: bf_zip_<hash(full_filename_stem)>
Atlas's current implementation. Survives nothing — every cosmetic filename change breaks every consumer's joins.
- Pro: Already shipped. No migration required if accepted.
- Pro: Hash collisions effectively impossible (24 hex chars = 96 bits).
- Con: Breaks on renames (
5→5b), typo-fixes, year-suffix additions, slug overhauls. - Con: Consumers cannot see "this is the same indicator, just renamed" — the old id orphans, the new id has no history.
- Verdict: Insufficient for production cross-year analysis.
(b) Number-prefix: bf_zip_ind_<N> derived from the filename's Indikator_<N> portion
Parse the leading Indikator_<N> (where N matches \d+[a-z]?) and use that as the id surrogate. The rest of the filename becomes display-only metadata in indicator_slug.
"Indikator_5_barn_i_hush_..." → bf_zip_ind_5
"Indikator_5b_barn_i_hush_..." → bf_zip_ind_5b (same as 5? deliberately different.)
"Indikator_9a_..." → bf_zip_ind_9a
"Indikator_22_..." → bf_zip_ind_22
- Pro: Survives the most common upstream change — slug refinements, year-suffix additions, methodology footnote rewrites — that don't change the indicator number.
- Pro: Aligns with the inside-workbook
Tab. <N>identifier (consistent with Bufdir's own convention). - Pro: Trivial to implement; one regex change in
parse.ts:surrogateIndicatorApiId(). - Pro: Backwards-compatible mart-schema change;
indicator_api_idis still text. - Con: A genuine rename like
5→5bis treated as a different indicator (which it might be —5bis a refinement, not a renumber). - Con: Renumbering events (the kind that produced the
10 → 9a/9bhistory) still appear as discontinuity. Needs an alias table to bridge. - Verdict: Strong default. Covers ~80 % of real upstream changes at zero ongoing cost.
(c) Title-hash: bf_zip_<hash(workbook_title_text)>
Use the Tab. <N>: title text from the workbook's row 0 instead of the filename.
- Pro: Survives filename-only changes (path moves, slug rewrites).
- Con: Bufdir occasionally rewords titles (e.g. "Barn i …" → "Personer 0-17 i …"); each reword breaks ids.
- Con: Norwegian Unicode normalisation, whitespace handling, em-dash variants — many subtle ways to produce non-deterministic hashes.
- Con: Titles are less stable than numbers in observed Bufdir history.
- Verdict: Weaker than (b). Don't pursue.
(d) Number-prefix + alias seed: (b) + a tiny human-curated bufdir_indicator_alias.csv
(b) plus a hand-authored seed file that maps historical → canonical ids when Atlas observes a renumber:
historical_id,canonical_id,note
bf_zip_ind_9,bf_zip_ind_9a,Indikator 9 was split into 9a (innvandrerbakgrunn Afrika etc) and 9b (EU etc) ~2024; consumers joining on the old id should use 9a as the closest successor
bf_zip_ind_10,,Indikator 10 retired by Bufdir (no successor in the bundle as of 2026-05-04)
The alias seed lives at atlas-data/dbt/seeds/sources/bufdir_indicator_alias.csv and feeds a marts.bufdir_indicator_alias dbt model placed under models/marts/api/ so the PLAN-004 generator auto-emits the api_v1.bufdir_indicator_alias wrapper. Consumers explicitly join on it when they want historical continuity; the default indicator_api_id join still works without it.
- Pro: Handles the real renumbering events (b) can't, while keeping the common-case clean.
- Pro: Human-curated — the editorial decision ("is 9a the canonical successor of 9, or is the old 9 fully retired?") is captured per row, not invented by code.
- Con: Maintenance overhead — every new bufdir release needs a diff check ("did Bufdir add a new indicator number? rename one?").
- Con: Adds one seed table + a per-source maintenance ritual.
- Verdict: Right answer for production. Recommended.
(e) Strapi/CMS direct lookup — query Bufdir's Strapi API for the canonical indicator id
Hit Bufdir's CMS (Strapi /api/...) instead of inferring from the workbook. The CMS does have stable hex ids.
- Pro: Bufdir's own canonical id; never invented.
- Pro: Survives filename changes by definition.
- Con: Re-introduces the live-API dependency that PR #60 deliberately removed for "operational simplicity."
- Con: Strapi API shape may not be public / documented; reverse-engineering required.
- Con: Adds an HTTP roundtrip per indicator (not just one for the ZIP).
- Con: Has its own brittleness (Strapi schema changes, auth requirements, rate limits).
- Verdict: Only worth pursuing if (b)+(d) prove insufficient in practice. Defer.
Recommended outcome
(d) — number-prefix + alias seed. Specifically:
- Change
parse.ts:surrogateIndicatorApiId()to parseIndikator_(\d+[a-z]?)from the filename stem and emitbf_zip_ind_<N>(e.g.bf_zip_ind_9a). Filename-stem hashing falls away. - Add a fallback path: if the filename doesn't match
Indikator_<N>(defensive — maybe Bufdir adds a non-numbered workbook), fall back to the current SHA-256-of-stem to keep ingest from throwing. Log a warn so the operator notices. - Add seed
atlas-data/dbt/seeds/sources/bufdir_indicator_alias.csvwith columnshistorical_id, canonical_id, note. Pre-populate the9→9aand the10→ null entries from the existing observed history. - Add the dbt model at
atlas-data/dbt/models/marts/api/bufdir_indicator_alias.sql(loads from the seed) + aschema.ymlentry with descriptions for all four columns (historical_id,canonical_id,note,source_id). Placement undermodels/marts/api/means the next./regenerate-api-v1.shrun auto-emitsapi_v1.bufdir_indicator_alias— no per-source generator wiring. PostgREST'smarts.*exposure (PLAN-007 Phase 1) will serveGET /bufdir_indicator_aliasautomatically once UIS lands the schema-list extension. - Document the consumer pattern in
bufdir-barnefattigdom/README.mdand the upcoming Phase 4/datapage so external developers know to use it for historical continuity. - Refresh
bufdir_indicator_alias.csvas part of every new bufdir refresh — add to the maintenance checklist in the source README.
Migration concern: existing indicator_api_id values in marts.indicators__bufdir_barnefattigdom are bf_zip_<24-char-hex>. After this change they become bf_zip_ind_<N>. This is a breaking change for any consumer that has cached the old ids. The bufdir source has been live for less than a week and the public API for it has not been advertised, so the breaking-change cost is essentially zero today. Land before any external integration starts depending on the current id shape.
Don't recommend (a): leaves the production gap. Don't recommend (b) alone: doesn't handle renumbering; alias is the cheap completion. Don't recommend (c): title-hashing is weaker than number-hashing in observed Bufdir behavior. Don't recommend (e): re-introduces the live-API dependency PR #60 removed; revisit only on signal.
Open questions
- [Q1] Should
bf_zip_ind_5andbf_zip_ind_5bbe the same id (under the assumption that5bis a refinement of5) or different (under the assumption that5is retired and5bis a separate indicator)? Recommendation: keep them different by default; let the alias seed bridge when an editorial decision says so. Conservative — code never claims continuity it can't prove. - [Q2] Where does the alias seed live in the cluster — under
seeds/sources/(alongside_sources_manifest.csv) or underseeds/(top-level)? Recommendation:seeds/sources/— same shape as the manifest seed, single seed-rebuilder script discovers everything in one directory. - [Q3] Should the alias table be exposed via
api_v1.bufdir_indicator_aliasso external consumers can join on it? Recommendation: yes, auto-exposed via both surfaces. Place the dbt model atatlas-data/dbt/models/marts/api/bufdir_indicator_alias.sqlso the existing PLAN-004 generator (regenerate-api-v1.sh) emits theapi_v1.*wrapper without per-source code, and so PostgREST'smarts.*schema serves it directly once PLAN-007 Phase 1 lands the schema-list extension. Auto-exposure over per-source decisions — see the architectural note below. - [Q4] Should the same id-strategy apply to other ZIP-shaped sources Atlas might onboard later (DSB workbooks, Bufdir-barnevern XLSX exports, etc.)? Recommendation: yes by default. The
Indikator_<N>parsing rule is bufdir-specific, but the convention "small alias seed inseeds/sources/<source>_indicator_alias.csv+ dbt model undermodels/marts/api/" generalises trivially. Each future ZIP source can copy the pattern; no per-source INVESTIGATE required unless a source genuinely deviates (e.g. its own version-stable id baked into the workbook).
Architectural principle: auto-exposure over per-source decisions
This INVESTIGATE's Q3/Q4 answers lean on a broader principle worth recording explicitly so future ZIP/alias work doesn't re-litigate:
New data should be served automatically. Atlas already has two mechanisms that auto-expose data without per-source code:
- PostgREST schema-list extension (PLAN-007 Phase 1, UIS-side, pending) — every new
marts.*table is queryable asGET /<table>the moment it ships, no separate exposure step.regenerate-api-v1.shgenerator (PLAN-004) — every dbt model undermodels/marts/api/auto-emits anapi_v1.<name>wrapper view.Together, the right design move for any new lookup / dim / fact / mart is to drop it under
models/marts/api/so both surfaces pick it up for free. No "should we expose this?" decision per source. No opt-in flag. No maintenance ritual gating on whether someone remembered to wire it up.Reverse-direction: an
api_v1.*wrapper is not a meaningful commitment for a small reference / alias table that Atlas controls end-to-end (we'd never break its column shape arbitrarily anyway). The wrapper layer's value is uniform discovery, not a separate stability tier.
This is the lens through which every future alias / lookup / catalogue artefact gets placed. Codified here so the next agent reading this INVESTIGATE doesn't re-debate "should the alias be in api_v1?" for whatever source comes after bufdir.
What this INVESTIGATE explicitly does NOT decide
- The implementation timing. This is research-only. A follow-up
PLAN-bufdir-surrogate-id-migration.mdwould sequence the migration (parse change → seed creation → backfill notes → consumer messaging) if the recommendation is accepted. - Whether the alias seed should also retroactively add entries for the (yet-unobserved)
Indikator_Xdeprecations. We add entries when we observe a deprecation — pre-populating speculative entries is wasted maintenance. - The rename-detection automation. A future ingest could diff today's filenames against last week's
_sources_dimensionsseed and flag suspect renames. That's a hardening step, not part of this scope.
Cross-references
atlas-data/ingest/src/sources/bufdir-barnefattigdom/parse.ts— currentsurrogateIndicatorApiId()lives here; option (b) implementation lands here.atlas-data/ingest/src/sources/bufdir-barnefattigdom/README.md— Known-quirks line on surrogate id (added in PR #68) is the user-facing flag this investigation closes.INVESTIGATE-bufdir-barnefattigdom-zip-ingest.md— Composer-2's original investigation that flagged this question and deferred it.- PR #60 — the streaming ingest that introduced the surrogate-id strategy.
- PR #67 — the
parse.tssplit + golden-file tests; covers the filename-derivation logic this INVESTIGATE proposes changing. - PLAN-007 Phase 3 —
mart_meta_dimensions— when Phase 3 ships, the alias seed shape may want to be reflected in the catalogue's metadata views.
Next steps
- User reviews + accepts (or refines) the recommended outcome (d).
- If accepted, draft
PLAN-bufdir-surrogate-id-migration.mdwith the parse change, the alias seed, the dbt model, and the consumer-doc update sequenced. - If deferred (acceptable: bufdir is new, no external consumers yet), capture the decision + the trigger condition in this file.
- Move this INVESTIGATE backlog/ → active/ when the migration PLAN starts; active/ → completed/ when it ships.
— signed, the Atlas implementation team (via Claude Code agent), 2026-05-05