PLAN: Generate indicators__* schema.yml from manifest dimensions
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Backlog
Goal: Replace hand-edited models/indicators/schema.yml (currently 25 % column coverage) with a generated artefact derived from each source's manifest.yml dimensions + a shared conformed-columns dict + period-pattern templates. After this PLAN ships, adding a new ingest source means authoring dimensions: in manifest.yml once — every column description in marts/api_v1/PostgREST/dbt-docs/MCP is filled automatically.
Investigation: INVESTIGATE-indicators-schema-coverage.md — settles the (b)+(e) decision, resolves Q1-Q5, and captures the 2026-05-10 spike that refined the manifest schema to use derives: + derived_via: + enriched_columns: instead of the original simpler column_map: shape.
Last Updated: 2026-05-10
Prerequisites:
- ✅
+persist_docsenabled indbt_project.yml(PR #89) — what makes the descriptions actually reachpg_descriptiononce schema.yml has them. - ✅
npm run bootstrapPhase 8 (docs) live (PR #89) — refreshestarget/catalog.jsonfor the dbt docs UI after the schema.yml regenerates. - ✅
manifest.ymldimensions:block convention shipped (PLAN-007 Phase 2.11) for all 41 ingest sources. - ✅
apply-api-v1.shre-emits api_v1 wrapper COMMENTs on every cycle (PLAN-004) — unaffected by this work but sits in the same pipeline.
Blocks: None remaining.
Problem summary
PR #89 made marts.* + raw.* PostgREST endpoints visible to AI/MCP agents (open-by-default), and +persist_docs made schema.yml descriptions flow into pg_description for those endpoints. But indicators__* schema.yml is only 25 % covered (72 of 288 columns documented), so most of the indicator surface still reads as naked column names to agents.
INVESTIGATE-indicators-schema-coverage chose option (b) — generate descriptions from manifest.yml dimensions + a shared conformed-columns dict + period-pattern templates. The 2026-05-10 spike verified this works for three indicator-shape examples (SSB / FHI rolling-period / bufdir join-enriched) and refined the manifest schema accordingly.
This PLAN executes that decision in six phases. Each phase has clear validation; the last phase wires the generator into the existing dbt:rebuild and bootstrap flows so future model edits stay consistent automatically.
Phase 1: Author the conformed-columns + period-templates dict
The single hand-authored editorial input that gets reused for every source forever.
Tasks
- 1.1 Create
atlas-data/dbt/conformed-column-descriptions.ymlwith two top-level keys. - 1.2 Author
conformed_columns:— 7 entries, ~1 paragraph each:source_id— the source's primary key in this row, always'<provider>-<table-id>'(e.g.'ssb-08764'). Useful when consumingfact_kommune_indicatorsto filter to a specific ingest source.kommune_nr— 4-digit zero-padded kommune code. FK target ondim_kommune.kommune_nr. Filterdim_kommune.is_active = truewhen joining (perproject_dim-kommune-is-active-filter.mdmemory rule).fylke_nr— 2-digit zero-padded fylke code. FK target ondim_fylke.fylke_nr.contents_label— human-readable label forcontents_code(e.g.'Personer'forPersoner). Sourced from upstream classification metadata where available, blank otherwise.value— the indicator's numeric value (count, percentage, ratio, etc.). Unit conventions vary per source; cross-source aggregation requires reading each source'sdimensions:notes.status— upstream suppression marker (e.g.'..'for SSB-style suppressed cells,nullfor normal rows).updated_at— Atlas-side timestamp of when the row was loaded (provenance), not the upstream's "data updated" timestamp.
- 1.3 Author
period_templates:— 2 entries for v1 (annual,rolling_period_3yr); futurequarterly,monthly, etc. added incrementally as sources of those shapes land.period_templates:
annual:
year: "Calendar year, 4-digit integer."
period: "Same as year (single-year sources). Pass-through for compatibility with rolling-period sources."
period_start_year: "Equals year (single-year period)."
period_end_year: "Equals year (single-year period)."
rolling_period_3yr:
period: "Raw 3-year period string from upstream (e.g. '2022_2024')."
period_start_year: "First year of the rolling 3-year window."
period_end_year: "Last year of the rolling 3-year window."
year: "Midpoint year of the window. Use this for cross-source temporal joins; preserves the join-on-year invariant Atlas's fact tables rely on."
Validation
# File parses as valid YAML
uv run python -c "import yaml; print(yaml.safe_load(open('atlas-data/dbt/conformed-column-descriptions.yml')).keys())"
# → dict_keys(['conformed_columns', 'period_templates'])
# Conformed columns present (7 entries)
uv run python -c "import yaml; print(len(yaml.safe_load(open('atlas-data/dbt/conformed-column-descriptions.yml'))['conformed_columns']))"
# → 7
# Period templates present (2 for v1)
uv run python -c "import yaml; print(list(yaml.safe_load(open('atlas-data/dbt/conformed-column-descriptions.yml'))['period_templates'].keys()))"
# → ['annual', 'rolling_period_3yr']
Done when
- File exists, parses, contains both keys, has the 7 conformed entries + 2 period templates.
Phase 2: Build the generator with model-family registry
The generator is the executable form of the decisions in INVESTIGATE-indicators-schema-coverage.
Tasks
- 2.1 Create
atlas-data/dbt/scripts/generate-indicator-descriptions.py. Single entry point:python scripts/generate-indicator-descriptions.py [--check]. The--checkflag runs in dry-run-diff mode (used by the CI gate in Phase 5). - 2.2 Implement the model-family registry. Initially registers
indicators__*only, but the abstraction supports addingsupply__*later as a one-line config change:MODEL_FAMILIES = {
"indicators": {
"model_prefix": "indicators__",
"schema_yml_path": "models/indicators/schema.yml",
"conformed_dict_key": "conformed_columns", # what to read from the dict
},
# Future: "supply": { ... }
} - 2.3 Read inputs:
- Every
atlas-data/ingest/src/sources/*/manifest.yml(parsedimensions:withderives:/derived_via:per dim, plusenriched_columns:at top level if present). atlas-data/dbt/conformed-column-descriptions.yml(Phase 1 output).
- Every
- 2.4 For each source, expand each dim's column list:
derives: { col: ~ }→ use the dim'smeaning. Value format: <value_format>. <notes>as the column's description.derives: { col: "..." }→ use the explicit string.derived_via: <template>→ look upperiod_templates[<template>]and emit one entry per derived column with the template's description.
- 2.5 For each source, emit conformed-column descriptions from the shared dict (one entry per conformed-col that the indicator model actually has).
- 2.6 For each source, emit
enriched_columns:descriptions verbatim. - 2.7 Write the generated content to
models/indicators/schema.ymlas a single consolidated file (Atlas's dbt-osmosis convention). Include a comment header marking it generated:# GENERATED FILE — do not edit by hand.
# Source of truth: atlas-data/ingest/src/sources/<id>/manifest.yml
# Generator: atlas-data/dbt/scripts/generate-indicator-descriptions.py
# Run `npm run dbt:rebuild` (or bootstrap Phase schema-gen) to regenerate. - 2.8 Implement
--checkmode: regenerate to a temp file, diff against the committedmodels/indicators/schema.yml, exit non-zero if they differ. This is the deterministic-output property the CI gate (Phase 5) relies on. - 2.9 Preserve existing
data_tests:anddata_type:blocks per column. The generator only writesdescription:lines; tests and types stay in the committed schema.yml. (See [Q1] in the INVESTIGATE — descriptions are the only generator output.)
Validation
cd atlas-data/dbt
uv run python scripts/generate-indicator-descriptions.py
# Should emit: models/indicators/schema.yml (regenerated)
uv run python scripts/generate-indicator-descriptions.py --check
# After fresh generation, --check should be a no-op (exit 0).
# Idempotent — re-running produces the same output
uv run python scripts/generate-indicator-descriptions.py
git diff models/indicators/schema.yml
# → no diff
Done when
- Generator runs against the existing 23 manifests + the conformed dict; produces a valid schema.yml.
--checkmode is deterministic (re-running on unchanged inputs is a no-op).- Generator preserves all
data_tests:anddata_type:blocks; only writesdescription:lines.
Phase 3: Migrate all 23 manifests to the new schema
Bulk authorship. Most mechanical; per-source effort is low because manifests already have dimensions: blocks (Phase 2.11 of PLAN-007).
Tasks
- 3.1 Author a per-source migration helper script if needed (or do by hand). For each source:
- Inspect the indicator model's SQL (
models/indicators/indicators__<id>.sql) to identify which schema columns derive from which manifest dim. - Add
derives:to each dim that maps cleanly 1:1 (derives: { <col>: ~ }). - For period dims (
AAR,Tid), usederived_via: rolling_period_3yrorderived_via: annualinstead. - For dims that derive into multiple columns NOT covered by a template, use
derives: { col1: "...", col2: "..." }inline.
- Inspect the indicator model's SQL (
- 3.2 Migration sweep — work through all 23 sources. Estimated time per source: 2-5 min for SSB pattern (clean 1:1), 5-10 min for FHI rolling-period sources, 15-20 min for bufdir-barnefattigdom (the bespoke shape).
- 3.3 For
bufdir-barnefattigdomspecifically — addenriched_columns:at top level forindicator_slug,indicator_group_slug,indicator_name,indicator_title,link_text(the join-derived columns frombufdir_indicator_aliasseed). - 3.4 Update
atlas-data/dbt/scripts/build_sources_seed.pyto validatederives:/derived_via:/enriched_columns:shape on read — refuse to emit the seed if a manifest references an undefinedderived_via:template or has malformedderives:.
Validation
# All 23 manifests parse + validate against the schema
cd atlas-data/dbt && uv run python scripts/build_sources_seed.py
# → no errors
# Spot-check: every source has either `derives:` or `derived_via:` on every dim
python3 -c "
import yaml, glob
for path in glob.glob('atlas-data/ingest/src/sources/*/manifest.yml'):
m = yaml.safe_load(open(path))
for dim in m.get('dimensions', []):
if 'derives' not in dim and 'derived_via' not in dim:
print(f'MISSING: {path} — dim {dim[\"code\"]}')
"
# → no MISSING output
Done when
- All 23 manifests have
derives:orderived_via:on every dim. bufdir-barnefattigdom's manifest hasenriched_columns:for the 5 join-derived columns.build_sources_seed.pyvalidates the new fields and refuses malformed input.
Phase 4: Regenerate models/indicators/schema.yml + verify
The payoff phase — schema.yml gets regenerated against the now-complete manifests, descriptions flow through to PostgREST/MCP.
Tasks
- 4.1 Run
python scripts/generate-indicator-descriptions.pyto regeneratemodels/indicators/schema.yml. Commit the result. - 4.2 Run
npm run dbt:rebuildto materialise models with the new schema.yml content;+persist_docsissuesCOMMENT ON COLUMNfor every newly-described column. - 4.3 Verify pg_description coverage went up by curling marts.* spec:
Expect: ~317 → ~530+ (the 216-column gap closes).
curl -sS -H 'Accept-Profile: marts' "http://api-atlas.localhost/" | jq '
[.definitions[] | .properties // {} | to_entries[] |
select(.value.description != null and .value.description != "")] | length
' - 4.4 Spot-check three sources end-to-end:
marts.indicators__ssb_08764— every column has a description, derived columns fromRegion/Tid/ContentsCodeshow meaning + value_format.marts.indicators__fhi_mobbing—period_start_year/period_end_year/yearshow therolling_period_3yrtemplate content.marts.indicators__bufdir_barnefattigdom—indicator_slug,indicator_name, etc. show the enriched-column descriptions.
Validation
# Coverage curl (target: ~530+ descriptions in marts.*)
curl -sS -H 'Accept-Profile: marts' "http://api-atlas.localhost/" | jq '[.definitions[] | .properties // {} | to_entries[] | select(.value.description != null and .value.description != "")] | length'
# Spot-check ssb-08764 column kommune_nr (was: blank)
curl -sS -H 'Accept-Profile: marts' "http://api-atlas.localhost/" | jq -r '.definitions.indicators__ssb_08764.properties.kommune_nr.description'
# → "4-digit zero-padded kommune code. FK target on dim_kommune.kommune_nr..."
# Spot-check fhi-mobbing year (was: blank, now: rolling-period midpoint)
curl -sS -H 'Accept-Profile: marts' "http://api-atlas.localhost/" | jq -r '.definitions.indicators__fhi_mobbing.properties.year.description'
# → "Midpoint year of the window. Use this for cross-source temporal joins..."
# Spot-check bufdir indicator_slug (was: missing entirely)
curl -sS -H 'Accept-Profile: marts' "http://api-atlas.localhost/" | jq -r '.definitions.indicators__bufdir_barnefattigdom.properties.indicator_slug.description'
# → "Stable slug from bufdir_indicator_alias seed..."
Done when
- Marts.* description count rises from ~317 to ~530+ (the 216-column gap closes).
- Spot-checks across three source-shapes return the expected descriptions.
- No regression in api_v1.* coverage (still 104 columns documented).
Phase 5: Extend check-osmosis.sh with the diff gate
Locks in the constraint that committed schema.yml must match the generator's output — prevents future drift.
Tasks
- 5.1 Decide: extend
check-osmosis.shitself, or add a siblingcheck-indicators-schema.sh? Recommendation: sibling — keeps responsibilities separate. Osmosis checks "every column has a description"; the new gate checks "the indicator schema.yml matches what the generator would emit". - 5.2 Implement the sibling. ~30 lines of bash. Runs
python scripts/generate-indicator-descriptions.py --checkand exits non-zero if there's a diff. - 5.3 Wire it into the same CI / pre-commit path that runs
check-osmosis.shtoday.
Validation
# After a clean Phase 4: check should pass
./check-indicators-schema.sh
# → ✓ models/indicators/schema.yml matches generator output
# Hand-edit a description in models/indicators/schema.yml to simulate drift
git apply <<EOF
@@ models/indicators/schema.yml
- description: "Calendar year, 4-digit integer."
+ description: "tampered"
EOF
# Check should fail
./check-indicators-schema.sh
# → ✗ models/indicators/schema.yml is out of sync with the generator
# → diff: ...
Done when
check-indicators-schema.shexists, exits 0 on a clean state and non-zero on drift.- Wired into the same CI / pre-commit path as
check-osmosis.sh.
Phase 6: Wire schema-gen phase into bootstrap + dbt:rebuild
The "fully automatic" guarantee — every model rebuild regenerates the schema, no contributor checklist item.
Tasks
- 6.1 Add a new
schema-genPhaseId toatlas-data/ingest/scripts/bootstrap.ts. ImplementphaseSchemaGen()that spawnspython scripts/generate-indicator-descriptions.pyinDBT_DIR. - 6.2 Insert
schema-genbetweenseedandruninALL_PHASES(sincedbt runparses schema.yml as part of model compilation):Bootstrap is now 9 phases.const ALL_PHASES: PhaseId[] = [
"migrate", "refresh", "ingest", "seed", "schema-gen", "run", "api", "test", "docs",
]; - 6.3 Update
package.jsondbt:rebuildalias to includeschema-gen:"dbt:rebuild": "tsx --env-file=.env scripts/bootstrap.ts --only seed,schema-gen,run,api,test,docs" - 6.4 Update
setup.mdphase table — 8 → 9 rows; row 6 is the newschema-gen. - 6.5 Update
ingest-modules.mdtrigger matrix — every "edited a manifest dim" / "edited a model SQL" row now includesschema-genas part ofdbt:rebuild. Brief description-flow paragraph extension: "After PLAN-indicators-schema-generator ships, the schema.yml regenerates as Phase schema-gen on every dbt:rebuild — contributors never edit it directly."
Validation
# Dry-run shows 9 phases
npm run bootstrap -- --dry-run | grep "phases"
# → "9/9 phases (migrate → refresh → ingest → seed → schema-gen → run → api → test → docs)"
# --only schema-gen works in isolation
npm run bootstrap -- --only schema-gen
# → regenerates models/indicators/schema.yml; no other phase fires
# Full dbt:rebuild includes the schema-gen step
npm run dbt:rebuild
# → completes green; schema.yml regenerated; pg_description refreshed
Done when
bootstraplists 9 phases includingschema-gen.dbt:rebuildruns schema-gen between seed and run.setup.md+ingest-modules.mdreflect the new flow.- No regression — all existing tests + the new diff gate stay green.
Acceptance criteria
-
atlas-data/dbt/conformed-column-descriptions.ymlexists, parses, has 7 conformed columns + 2 period templates. -
atlas-data/dbt/scripts/generate-indicator-descriptions.pyruns deterministically, supports--check, has model-family registry. - All 23 indicator manifests have
derives:orderived_via:on every dim;bufdir-barnefattigdomhasenriched_columns:. -
models/indicators/schema.ymlregenerates from those inputs without manual edits. - Marts.* PostgREST spec gains ~216 column descriptions (
marts.dim_kommune.kommune_nr.descriptionetc. become non-null). -
check-indicators-schema.shruns in CI and gates against drift. -
bootstrapPhase 5 (schema-gen) wired in;dbt:rebuildupdated. - Contributor adding a new ingest source authors only the
manifest.ymldimensions:block + (if needed)derived_via:/enriched_columns:. Never touchesmodels/indicators/schema.yml.
Out of scope
- Generalising to
supply__*,dim_*,fact_*,mart_*— INVESTIGATE Q5 punted to v2; the model-family registry from Phase 2 keeps the door open. - Per-column overrides for already-documented sources — INVESTIGATE Q4 chose merge-into-manifest over preserve-hand-edits. Hand-tuning a single column's wording happens by editing manifest.yml.
- Documenting Atlas's period-merging policy outside the templates — the period templates ARE the policy; if extra prose framing is needed later, it's a small contributor-doc PR, not a separate INVESTIGATE.
- Making conformed-column descriptions queryable via PostgREST (the v1 trade-off in Q2). If wanted later, generator emits a derived seed CSV as a side-effect.
Files to modify
New:
atlas-data/dbt/conformed-column-descriptions.yml— Phase 1.atlas-data/dbt/scripts/generate-indicator-descriptions.py— Phase 2.atlas-data/dbt/check-indicators-schema.sh— Phase 5.
Updated (atlas-data):
atlas-data/ingest/src/sources/<id>/manifest.yml— 23 source files, Phase 3 manifest migration.atlas-data/dbt/models/indicators/schema.yml— Phase 4 regeneration. Becomes a generated artefact.atlas-data/dbt/scripts/build_sources_seed.py— Phase 3.4 validation update.atlas-data/ingest/scripts/bootstrap.ts— Phase 6.1/6.2 (newschema-genphase).atlas-data/ingest/package.json— Phase 6.3 (dbt:rebuildalias updated).
Updated docs:
website/docs/contributors/setup.md— Phase 6.4 (phase table 8 → 9 rows).website/docs/contributors/ingest-modules.md— Phase 6.5 (trigger matrix + description-flow paragraph).
Cross-references
- INVESTIGATE-indicators-schema-coverage.md — the parent INVESTIGATE; this PLAN executes its (b)+(e) decision.
- PR #89 — bootstrap Phase 8 + persist_docs — the prerequisite that makes pg_description the actual sink for these descriptions.
- PLAN-007 Phase 2.11 — introduced the
manifest.ymldimensions:block; this PLAN consumes it. - PLAN-004 — postgrest api_v1 wrapper — the api_v1 wrapper generator; pattern this PLAN's generator mirrors (deterministic regeneration, CI gate).
atlas-data/dbt/check-osmosis.sh— the existing description-coverage gate; the newcheck-indicators-schema.shsits alongside.atlas-data/dbt/dbt_project.yml—+persist_docsconfig (PR #89) that completes the chain.
Implementation notes
- Phase order is hard. Phase 1 must precede Phase 2 (generator reads the dict). Phase 2 + 3 can interleave but Phase 4's verification needs both done. Phase 5 needs Phase 4's clean state to baseline against. Phase 6 needs everything before to be stable.
- Single PR vs split — the natural shape is one PR for Phases 1+2+5 (the tooling), one PR for Phase 3+4 (the migration + verification), one PR for Phase 6 (the integration). Smaller PRs are reviewable; the tooling PR is the load-bearing one.
- Phase 3's manifest migration is the bulk of the work. Don't underestimate it. Plan for ~3-5 hours wall-time across 23 sources, more for bufdir-barnefattigdom.
- Don't break api_v1 coverage. apply-api-v1.sh emits its own COMMENT ON for api_v1.* views; this PLAN doesn't touch that. Sanity-check after Phase 4 that the 104 api_v1 column descriptions are still in place.
- The CI gate (Phase 5) is what makes the system self-healing. Without it, a contributor could hand-edit
models/indicators/schema.ymland the file silently drifts. With it, the next CI run fails loudly with a diff. - Test the generator's idempotency early. Phase 2's
--checkmode is the deterministic-output property the gate relies on. If two consecutive runs produce different output (e.g. dict iteration order, YAML emit order), the gate is unreliable. - Forkability:
atlas-frontend/is unaffected — this PLAN is entirely on the data side. No customer-frontend changes.