Plan 003: Tests and naming-conventions for the new vocabulary
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Completed
Goal: Lock in the new canonical vocabulary introduced by PLAN-001 + PLAN-002 — add accepted_values tests for the decoded enum columns, accepted_range tests for the parsed integer columns, relationships tests from indicator codes back to their ref_* seeds, and extend docs/stack/naming-conventions.md with every new field.
Last Updated: 2026-04-22 Completed: 2026-04-22
Investigation: INVESTIGATE-code-label-mapping.md Prerequisites: PLAN-001 (seeds) ✓ and PLAN-002 (indicator models) ✓ — both completed 2026-04-22. Priority: Medium — closes the investigation.
Overview
PLAN-002 added new columns but only fixed the schema.yml entries that broke dbt test (renames of sex_code → sex, education_level → parents_education). This plan adds tests for everything new, so future changes can't silently drift the vocabulary, and it updates the canonical naming table so contributors choose the right names without asking.
After this plan:
- Every code column with a corresponding seed has a
relationshipstest back to that seed (single source of truth — when the seed changes, the test follows). - Decoded enum columns without a seed (
sex,housing_status,grade) have anaccepted_valuestest pinned to the small known set. - Every
_label_no/_label_encolumn derived from a seed has anot_nulltest (left join + relationships means nothing should slip through as null). - Every parsed integer column (
period_start_year,age_group_min,age_int, etc.) has anaccepted_rangetest. naming-conventions.mdlists every new canonical field, plus the FHI raw column names that must never leak into marts.
One test per concept, not two. Where a relationships test to a seed already catches drift, do not also add accepted_values listing the same codes — the seed is the source of truth and accepted_values would be a duplicate to hand-edit.
The seed-README task from the original investigation checklist ("Document each seed in dbt/seeds/README.md with source and update policy") was already done as part of PLAN-001 — no-op here.
Phase 1: schema.yml — accepted_values, ranges, relationships
Walk the 9 indicator models with new columns and pin their values.
Tasks
- 1.1
indicators__ssb_06083—family_typerelationships toref_ssb_family_type.code.family_type_label_no+_label_ennot_null. - 1.2
indicators__ssb_06944—household_typerelationships toref_ssb_household_type. Label columnsnot_null. - 1.3
indicators__ssb_07459—sexaccepted_values['male','female','all'](no seed for sex).age_intaccepted_range 0–110 (nullable).age_minaccepted_range 0–110. - 1.4
indicators__ssb_09429—education_levelrelationships toref_ssb_nivaa.education_level_label_no/_ennot_null. Verifysexaccepted_values is already present (added in PLAN-002 schema.yml? if not, add). - 1.5
indicators__ssb_12944—period_start_year/period_end_yearaccepted_range 2000–2050 (nullable for999A-style rows).age_group_min/age_group_maxaccepted_range 0–120 (nullable). - 1.6
indicators__fhi_bor_alene—period_*_yearandage_group_min/age_group_maxaccepted_range as above. - 1.7
indicators__fhi_mobbing—sexaccepted_values;period_*_yearaccepted_range. - 1.8
indicators__fhi_trangbodd—parents_educationrelationships toref_fhi_utdann.parents_education_label_nonot_null.period_*_yearandage_group_*accepted_range. - 1.9
indicators__fhi_vgs_gjennomforing—sexaccepted_values;parents_educationrelationships + labelnot_null;immigration_categoryrelationships toref_fhi_innvkat+ labelnot_null;period_*_yearaccepted_range.
Validation
cd atlas-data/dbt
uv run --env-file ../ingest/.env dbt test --select indicators
User confirms: every previously-passing test still passes; the count grows by ~30 new tests; zero new errors.
Phase 2: schema.yml — seed tables themselves
Add a schema.yml entry per ref_* seed with the exact accepted_values and primary-key uniqueness on code. Catches anyone editing a CSV in a way that breaks downstream joins.
Tasks
- 2.1 Create
atlas-data/dbt/seeds/schema.ymlwith oneseed:entry per CSV. For each:codenot_null+unique;label_nonot_null;label_enno test (blank for FHI);sort_ordernot_null+unique+accepted_range: 1..N(where N is the row count of that seed). The seed CSVs themselves are the canonical code list — noaccepted_valuesecho needed.
Validation
cd atlas-data/dbt
uv run --env-file ../ingest/.env dbt test --select source:atlas seeds.*
# Or simply:
uv run --env-file ../ingest/.env dbt test
User confirms 5 new seed-level tests pass.
Phase 3: Update naming-conventions.md
Add every new canonical name to the vocabulary table; add raw FHI column names to the "Never in marts" forbidden list. Update the existing sex row to allow the third value (all) introduced by decode_sex.
Tasks
-
3.1 In
docs/stack/naming-conventions.mdupdate thesexrow:One of "male", "female", "all". Mention the{{ decode_sex(col) }}macro. -
3.2 Add new vocabulary rows to the canonical table:
Concept Canonical name Type Rules Period start year period_start_yearintegerparsed from period; null if not parseablePeriod end year period_end_yearintegersame Single-year age as int age_intintegernull for open-ended ( 105+); useage_minfor sortable floorFloor of single-year age (incl. open-ended) age_mininteger105for105+Age band lower bound age_group_minintegerparsed from age_group; null for cryptic codes (999A)Age band upper bound age_group_maxintegersame Family type (SSB FamilieType) family_typetextcode 001–009; must exist inref_ssb_family_typeFamily type label (Norwegian) family_type_label_notextfrom ref_ssb_family_typeFamily type label (English) family_type_label_entextfrom ref_ssb_family_typeHousehold type (SSB HusholdType) household_typetextcode 0000–0004; must exist inref_ssb_household_typeHousehold type label household_type_label_no/_label_entextfrom ref_ssb_household_typeEducation level — subject's own (SSB Nivaa NUS2000) education_leveltextcodes from ref_ssb_nivaa; only when source measures the subject's own levelEducation level label education_level_label_no/_label_entextfrom ref_ssb_nivaaEducation level — parents' (FHI UTDANN) parents_educationtextcodes 0–4; must exist inref_fhi_utdann. Use this when the source stratifies a child outcome by parental education (FHI 360, 794)Parents' education label parents_education_label_notextfrom ref_fhi_utdann(no English)Immigration category (FHI INNVKAT) immigration_categorytextfrom ref_fhi_innvkatImmigration category label immigration_category_label_notext(no English) Housing status (FHI BODD) housing_statustext"trangt"/"uoppgitt"; readable as-is, no seedSchool grade (FHI TRINN) gradetext"7"or"10"; readable as-is -
3.3 Add to the "Never in marts" forbidden list:
Seen upstream Never in marts — use this instead kjonn_code,sex_codesex(decoded viadecode_sex)aar_codeperiod(text) and/orperiod_start_year/period_end_year(int)alder_codeage_group(text) and/orage_group_min/age_group_max(int)utdann_code(FHI parents' education)parents_education(+parents_education_label_no)innvkat_codeimmigration_category(+immigration_category_label_no)bodd_codehousing_statustrinn_codegrade -
3.4 Add a short subsection "Decoding strategy reference" at the end pointing to
dbt/macros/parse_codes.sql, themarts.ref_*seeds, and the completed investigation.
Validation
User reviews docs/stack/naming-conventions.md. Vocabulary entries match what's actually in marts.indicators__* (cross-reference any column from \d marts.indicators__ssb_06083 against the table).
Phase 4: Final full-suite verification
Tasks
- 4.1
dbt build --full-refresh— clean. ✓ - 4.2 Test count: PLAN-002 ended at 290 PASS / 305 TOTAL (
dbt test). PLAN-003 finaldbt buildreports PASS=406, WARN=15, ERROR=0, TOTAL=421. (Note:dbt buildincludes seed and source testsdbt testskips, so the absolute jump is larger than just the new tests added by this plan; the indicator-only test count grew from 290 → 326.) Same 15 warns as baseline. ✓
Validation
cd atlas-data/dbt
uv run --env-file ../ingest/.env dbt build --full-refresh
User confirms the suite is green.
Acceptance Criteria
- Every code column with a corresponding seed has a
relationshipstest back to it (no duplicateaccepted_values). - Decoded enum columns without a seed (
sex,housing_status,grade) have anaccepted_valuestest. - Every
_label_no/_label_encolumn derived from a seed has anot_nulltest. - Every parsed integer column (
period_*_year,age_int,age_min,age_group_min/_max) has anaccepted_rangetest. - All five
ref_*seeds have schema.yml entries withnot_null+uniqueoncode, and tests onsort_order. -
docs/stack/naming-conventions.mdlists every new canonical field and every forbidden FHI raw name. -
dbt buildruns clean (PASS grows by ~25, ERROR=0, WARN=15 unchanged).
Implementation Notes
- Why
relationshipsand notaccepted_valuesfor code columns with a seed. The seed is the source of truth.relationshipstests against the seed catch every drift mode (added/removed/renamed codes);accepted_valueswould be a duplicate code list to hand-edit. One test per concept. - Why
accepted_valuesforsex/housing_status/grade. These have no seed, so an enumerated list inschema.ymlis the only place to assert what the values can be. - Why label columns are
not_nulleven though the join isleft. They're only null when the underlying code isn't in the seed. Therelationshipstest ensures every code is in the seed. Sonot_nullon the label closes the loop: if anot_nulllabel fails, it means the code passedrelationshipsbut the join still produced null — a bug worth seeing immediately. - Why no test that
period_start_year <= period_end_year. dbt-utils hasexpression_is_truethat can express this. Worth adding if it's a one-liner, but not blocking. - What this plan does not do. No model changes (PLAN-002 closed that). No new seeds. No frontend changes. The seeds README from the investigation checklist was done in PLAN-001.
Files to Modify
Edit:
atlas-data/dbt/models/indicators/schema.yml— add tests on the 9 touched modelsdocs/stack/naming-conventions.md— vocabulary expansion + forbidden list
New:
atlas-data/dbt/seeds/schema.yml— oneseed:entry per ref_*