Investigate: closing the schema.yml description gap on indicators__* models
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Backlog
Goal: Decide a sustainable shape for column-level descriptions across the 23 models/indicators/indicators__<source>.sql per-source pass-through models — currently 25 % covered (72 of 288 columns documented), hard to fill by hand, repetitive across sources, but increasingly load-bearing now that +persist_docs (PR #89) pushes schema.yml descriptions into pg_description and PostgREST → MCP agents see them.
Last Updated: 2026-05-10 (spike findings + manifest schema refined to derives: + period-pattern templates; original 2026-05-09 framing preserved)
Origin: PR #89 enabled +persist_docs: { relation: true, columns: true } on dbt models + seeds. Live verification surfaced that only 267 of ~566 marts.* columns gained descriptions — the missing 249 are mostly indicators__* per-source pass-throughs. Hand-counted: of 288 total column entries in models/indicators/schema.yml, only 72 have a description: line (25 %). The marts.* API surface is now visible to AI agents but most of the indicators layer reads as naked column names.
The desired end-state — single editorial input, every consumer fed automatically
The intent of this work, plain: adding a new dataset should mean adding the source's editorial content in exactly one place — manifest.yml — and having that flow automatically into every system that needs it. No per-surface re-authorship. No "did you remember to update models/indicators/schema.yml too?" checklist item.
The contributor's workflow before vs after:
BEFORE (today) AFTER (this INVESTIGATE's outcome)
────────────── ─────────────────────────────────
1. Scaffold folder + index.ts 1. Scaffold folder + index.ts
2. Write prose README.md 2. Write prose README.md
3. sources:bootstrap-manifest 3. sources:bootstrap-manifest
4. sources:fill-manifest-todos 4. sources:fill-manifest-todos
5. Hand-author dimensions: in manifest.yml 5. Hand-author dimensions: in manifest.yml ← SAME (still editorial)
6. Add ingest:<id> to package.json 6. Add ingest:<id> to package.json
7. Hand-edit schema.yml column descriptions ← (skipped 75 % of the time today)
8. ingest:<id> + dbt:rebuild 7. ingest:<id> + dbt:rebuild ← schema.yml regenerates
inside dbt:rebuild
The integration point: a new schema-gen phase inside npm run dbt:rebuild (and npm run bootstrap), running after seed and before run (since dbt run parses schema.yml as part of model compilation):
seed → schema-gen → run → api → test → docs
↑ NEW
Generator inputs (refined post-spike, see "Spike findings" below):
- Every
atlas-data/ingest/src/sources/<id>/manifest.yml— three sections feed the generator:dimensions:blocks withderives:per dim (the dim's editorial content fanned out across the schema columns it produces — handles 1:1 and 1:N derivations cleanly), optionally replaced byderived_via: <template_name>to reference a shared period-pattern templateenriched_columns:at the top level — for join-derived columns that don't trace back to an upstream dim (e.g. bufdir'sindicator_slugfrom thebufdir_indicator_aliasseed)
- One shared
atlas-data/dbt/conformed-column-descriptions.ymlcarrying:conformed_columns:(~7 entries —source_id,kommune_nr,fylke_nr,contents_label,value,status,updated_at)period_templates:— reusable definitions for common period shapes (annual,rolling_period_3yr, futurequarterly,monthly). Each template defines the description for every derived column it produces. Multiple sources reference the same template viaderived_via:.
Generator output: models/indicators/schema.yml, regenerated deterministically.
Editorial work that stays human, vs work that disappears:
| Surface | Editorial input today | Editorial input after this PLAN |
|---|---|---|
/data/sources/<id> catalog card (description + dimensions) | manifest.yml (description + dimensions:) | Same — one place |
mart_meta_dimensions editorial pass-through | manifest.yml dimensions: (via Phase 2.11 _sources_dimensions seed) | Same — one place |
models/indicators/schema.yml column descriptions | Hand-edited per source (skipped 75 % of the time) | Generated from manifest + conformed dict |
| pg_description (Postgres COMMENT ON) | Already auto via +persist_docs (PR #89), but only for descriptions that exist in schema.yml | Now actually populated — generated schema.yml feeds it |
| PostgREST OpenAPI spec | Auto from pg_description | Auto from pg_description |
atlas-frontend /data/[schema]/[table] column descriptions | Auto from PostgREST spec | Auto from PostgREST spec |
| dbt docs catalog.json | Auto from dbt docs generate (Phase 8 of bootstrap) | Auto from dbt docs generate |
| MCP tool definitions for AI agents | Auto from pg_description / dbt manifest | Auto from pg_description / dbt manifest |
Future developer-atlas.helpers.no API reference | Auto from PostgREST spec via Scalar (PLAN-008 Phase 1) | Auto from PostgREST spec via Scalar |
Net editorial input per new dataset: one dimensions: block in manifest.yml (3-5 entries, written from upstream API knowledge — the contributor needs to know the data anyway). That single input feeds eight downstream surfaces. The generator + persist_docs + apply-api-v1.sh + dbt docs generate are the plumbing that fans it out.
Runtime fan-out (post-generator, post-PR-#89):
manifest.yml dimensions ─┐
conformed-cols dict ┤
▼
generator (runs in dbt:rebuild Phase "schema-gen")
▼
models/indicators/schema.yml (regenerated, committed)
│
▼
dbt run (+persist_docs)
▼
pg_description (Postgres COMMENT ON)
│
┌──────┬──────────┼──────────┬──────────┐
▼ ▼ ▼ ▼ ▼
PostgREST /data dbt docs MCP api-types.ts
OpenAPI catalog catalog tools (codegen)
spec UI JSON for AI
agents
Five real-world surfaces, one editorial input, one generator run.
What this still doesn't auto-fix (worth being explicit so the "fully automatic" claim doesn't oversell):
- The
dimensions:block still needs human authorship. A generator can't invent whatEUskala60means or what region-code prefixes signify. But it's once, in one file, in the contributor's natural catalog-authorship workflow. - The conformed-columns dict needs first-time authorship (~7 entries × 1 paragraph each = ~30 min, one time).
- New atlas-conformed columns added in the future (rare — Atlas's mart shape has been stable since PLAN-001) need an entry in the dict.
- Models OUTSIDE indicators (
dim_*,fact_*,mart_*,supply__*for now — see [Q3]) still have hand-written schema.yml. Generator scope isindicators__*for v1.
The actual gap
Numbers as of 2026-05-09:
$ wc -l atlas-data/dbt/models/indicators/schema.yml
1535 lines
$ grep -c "^ - name:" indicators/schema.yml # column entries
288
$ grep -c " description:" indicators/schema.yml # column-level descriptions
72
→ 25 % column coverage
Per-model breakdown for the SSB indicator models (23 indicators models total across SSB / FHI / Bufdir; 9 SSB ones shown):
indicators__ssb_08764 3/10 documented ← the only one with any coverage
indicators__ssb_06083 0/13
indicators__ssb_06913 0/10
indicators__ssb_06944 0/13
indicators__ssb_09429 0/14
indicators__ssb_12063 0/10
indicators__ssb_12131 0/10
indicators__ssb_12132 0/10
indicators__ssb_13995 0/10
FHI / Bufdir / SSB-crime indicators — same pattern, mostly zero.
The 3 columns that ARE documented on indicators__ssb_08764 are region_code, year, contents_code — the source-specific ones (mapped from upstream Region, Tid, ContentsCode dimensions). The 7 atlas-conformed columns (source_id, kommune_nr, fylke_nr, contents_label, value, status, updated_at) are undocumented even there. Why this matters: it confirms the manifest-dimensions → indicator-columns mapping is the right shape — every existing description ALREADY corresponds 1:1 to a manifest dimension entry. Migration concern in [Q4] is therefore well-bounded.
Spike findings (2026-05-10) — manifest schema needs derives: + enriched_columns:
Did a 30-min thought-experiment spike against three indicator-shape examples to verify the original Q1 column_map: design held up. It didn't fully — three real issues surfaced. Findings below; the resolved Q1 was updated to absorb them.
| Source | Manifest dims | Schema columns | 1:1 mapping works? |
|---|---|---|---|
indicators__ssb_08764 | 3 (Region, ContentsCode, Tid) | 10 (3 src-specific + 7 conformed) | ✅ Clean — every src-specific column maps to one dim |
indicators__fhi_mobbing | 6 (GEO, AAR, KJONN, TRINN, SPM_ID, MEASURE_TYPE) | 16 columns | ❌ AAR derives 4 columns (period, period_start_year, period_end_year, year) — simple column_map: covers only 1 |
indicators__bufdir_barnefattigdom | 6 (indicator_api_id, region_code, category_unit, category_format, year, values_json) | 8+ columns including indicator_slug, indicator_group_slug, indicator_name, indicator_title, link_text | ❌ Schema has join-enriched columns (from bufdir_indicator_alias seed) that don't exist as upstream dims at all |
Three issues
-
Multi-column derivation from one dim (fhi-mobbing). Atlas's data model deliberately explodes a 3-year period like
AAR='2022_2024'into four schema columns:period(raw string),period_start_year(2022),period_end_year(2024),year(midpoint = 2023). One upstream dim → four schema columns. The simplecolumn_map: { AAR: period }only covers one. This is intentional architecture — Atlas's "midpoint-year + period evidence" pattern is what makes cross-source temporal joins possible (mobbing's 3-year rolling × ssb's annual income data, both align onyear); it can't be removed without crippling the join story. -
Join-enriched columns (bufdir).
indicators__bufdir_barnefattigdomjoins with thebufdir_indicator_aliasseed to addindicator_slug,indicator_name,link_text, etc. These columns aren't upstream — they're Atlas-side enrichments produced by the indicator model's SQL JOIN. The manifest'sdimensions:block can't represent them because they have no upstream-dim source. -
Conformed/source-specific gray zone.
region_codelooks conformed (atlas-side standardisation, FK target shape) but actually carries the upstream raw region code with semantics that differ per source (SSB's 6-digit bydel codes vs FHI's mixed-length codes). The simple two-dict (manifest + conformed) doesn't capture this. Resolution: treatregion_codeas derived-from-dim per source rather than conformed; the conformed-cols dict no longer claims it.
How the schema extension absorbs all three
Manifest schema gains two optional fields. Worked example for fhi-mobbing:
dimensions:
- code: AAR
meaning: 3-year rolling period
value_format: '"YYYY_YYYY" range string'
notes: "7 periods, e.g. 2016_2018 through 2022_2024"
derived_via: rolling_period_3yr # ← references shared template (covers issue 1)
- code: KJONN
meaning: Sex
derives:
sex: ~ # ← `~` = "use the dim's meaning + value_format + notes"
- code: SPM_ID
meaning: Question id (bullying composite)
derives:
question_id: ~
# (other dims similar)
For bufdir's join-enriched columns:
enriched_columns: # ← top-level (covers issue 2)
- name: indicator_slug
description: "Stable slug from bufdir_indicator_alias seed; renumbering events bridge through the alias historical_id → canonical_id mapping."
- name: indicator_name
description: "Display name from bufdir_indicator_alias seed."
- name: link_text
description: "Anchor text for the upstream-page deep link, from bufdir_indicator_alias."
Period-pattern templates living in conformed-column-descriptions.yml:
period_templates:
annual:
year: "Calendar year, 4-digit integer."
period: "Same as year (single-year sources)."
period_start_year: "Equals year (single-year period)."
period_end_year: "Equals year (single-year period)."
rolling_period_3yr:
period: "Raw 3-year period string passed through (e.g. '2022_2024')."
period_start_year: "First year of the rolling 3-year window."
period_end_year: "Last year of the rolling 3-year window."
year: "Midpoint year of the window. Use this for cross-source temporal joins; preserves the join-on-year invariant Atlas's fact tables rely on."
Future quarterly / monthly sources add quarterly, monthly templates the same way. Each template is the period-merging policy in template form — the shared library where Atlas's "midpoint year + period evidence" pattern is documented as code. No separate "Atlas period-merging policy" INVESTIGATE needed — the templates are the policy.
Coverage after the refinement
For the three spike-tested sources:
- ssb-08764: 10/10 documented — 3 src-specific dims with
derives: { col: ~ }+ 7 conformed. - fhi-mobbing: 16/16 documented — AAR uses
derived_via: rolling_period_3yr(4 cols), 5 other dims withderives:, 7 conformed. - bufdir-barnefattigdom: 8+/8+ documented — 6 dims with
derives:(some 1:1, some 1:N), 5 enriched columns at top level.
Net: every column gets a description after one editorial pass per source.
Why hand-filling is the wrong shape
indicators__* models follow a near-identical column shape across sources. Verified against indicators__ssb_08764 and indicators__ssb_06913:
| Atlas-conformed columns (consistent across indicators models) | Source-specific columns (from upstream dimensions) |
|---|---|
source_id (always '<provider>-<table-id>') | region_code (mapped from upstream Region-like dim) |
kommune_nr (FK to dim_kommune) | year (mapped from upstream Tid/year dim) |
fylke_nr (FK to dim_fylke) | contents_code (mapped from upstream ContentsCode-like dim) |
contents_label (decoded label) | (some sources have additional source-specific dims) |
value (numeric) | |
status (e.g. '..' for suppressed) | |
updated_at (loaded timestamp) |
That's ~7 atlas-conformed columns × 23 sources ≈ 161 conformed-column descriptions, plus ~3 source-specific × 23 ≈ 69 source-specific = ~230 total schema.yml descriptions in the load-bearing fully-documented state. Today only 72 are written. Hand-filling the remaining ~160 is real labor and creates the wrong incentive — the conformed-column descriptions become the primary place to update if kommune_nr semantics change across all 23 sources at once, instead of the canonical dim/fact models that own those concepts.
Equally, dbt-osmosis (already in use) propagates descriptions across the dbt graph but only when there's a parent description to propagate. Today most indicators__* parents (raw.* sources via models/sources/sources.yml) have table-level descriptions but few column-level ones. So osmosis has nothing to lift.
The interesting observation: Atlas already captures the source-specific semantic content in manifest.yml's dimensions: block (PLAN-007 Phase 2.11). Each upstream dimension gets code, meaning, value_format, notes. That's exactly the editorial content that would belong on the corresponding indicators__* column.
So the question this INVESTIGATE settles is: how do we close this gap without writing 250 redundant descriptions by hand, and without losing the ability to refine them per-source where it matters?
Options
(a) Hand-fill all 216 missing indicator-column descriptions
Just write them. 23 sources × ~7 conformed + ~3 source-specific = ~230 descriptions, of which 72 are already written, leaving ~158-216 (counting depends on which models have which columns). ~1-2 days of mechanical work.
- Pro: simple, no new tooling, every column gets a tailored description.
- Pro: easy to refine post-hoc — no generator to re-run.
- Con: ~160 conformed-column descriptions are near-copies of each other. Drift risk: ssb-12063's
kommune_nrdescription vs ssb-12131'skommune_nrdescription will differ over time if no canonical source exists. - Con: every new ingest source repeats the work. As the cloud-agent pipeline (opened 2026-05-04) onboards more sources, this becomes a recurring tax.
- Verdict: works in the short term; fragile as the source count grows. Skip if a generator path is viable.
(b) Generate indicators__* descriptions from manifest.yml dimensions + a shared atlas-conformed-columns dictionary
A small generator (TypeScript, lives at atlas-data/dbt/scripts/generate-indicator-descriptions.py or .ts) does this on every run:
- Read every
atlas-data/ingest/src/sources/<source_id>/manifest.yml. - For each source, walk its
dimensions:block — map upstream dimensioncodeto the corresponding indicator column name (e.g.Region→region_code;Tid→year;ContentsCode→contents_code). Per-source override map handles nonstandard mappings. - Combine with a single shared
atlas-data/dbt/conformed-column-descriptions.yml(~7 entries) that documents the atlas-conformed columns once:source_id,kommune_nr,fylke_nr,contents_label,value,status,updated_at. Hand-authored once; reused 23 times. - Rewrite the consolidated
models/indicators/schema.yml(Atlas's dbt-osmosis convention is one schema.yml per directory; the generator preserves that shape). - Wire as a pre-commit hook or CI step that fails if the generated output doesn't match the committed schema.yml — analogous to
check-osmosis.sh's existing gate.
- Pro: ~230 descriptions become 1 conformed-columns file (7 entries) + 23 manifest
dimensions:blocks (which Atlas already maintains for the catalogue per Phase 2.11). - Pro: New ingest sources auto-document themselves once the contributor authors
dimensions:(a Phase 2.11 step they already do). - Pro: dbt-osmosis stays compatible — the generator emits standard schema.yml; osmosis still propagates descriptions downstream from indicators to facts/marts.
- Con: new tooling — one generator script + one conformed-columns YAML + one CI gate. ~150 lines of code.
- Con: per-source overrides for column-name mismatches need a small mapping table (e.g. some sources call it
region_code, others call itregion, othersgeo_code). Mapping lives in the generator config. - Verdict: the investment pays back quickly. Write once, every new source benefits. Strongest option.
(c) Don't fill indicators__* — defer to api_v1.* wrappers as the documented surface
Lean on the fact that api_v1.* views (Atlas's "stable contract" surface) are 100 % documented. External developers and AI agents are pointed at api_v1; marts.* and especially indicators__* are presented as "internal — descriptions optional."
- Pro: zero work today. The 25 % coverage stays as-is.
- Pro: consistent with the "api_v1 is the contract" framing in the developer docs.
- Con: contradicts the recent open-by-default posture (PLAN-007). Atlas's
/datacatalog now exposes marts.* as a first-class surface; AI agents reachingmarts.*see the gap. - Con: future-Atlas pressure: as more consumers (innovators, LLMs, journalists) hit
marts.*directly, the cost of "no description" rises. Deferring just postpones a larger fix. - Verdict: short-term escape valve, not a real answer. Don't choose unless the team explicitly accepts that
marts.*is a tier-2 surface for documentation purposes.
(d) Hybrid — generate the conformed columns, hand-write the source-specific ones
Splits (b)'s scope:
-
The ~7 atlas-conformed columns generated from a single shared dictionary (covers ~70 % of the missing volume — 161 of ~230 total).
-
The ~3 source-specific columns hand-written in each
indicators__<source>schema.yml (covers the remaining ~30 %). -
Pro: smaller initial generator (no manifest.yml integration; just a shared dict and a list of sources).
-
Pro: source-specific columns get hand-tailored descriptions where the editorial work has high payoff.
-
Con: contributor still has to write something per new source — a regression vs (b) where dimensions:-block authoring covers source-specific automatically.
-
Verdict: plausible if (b)'s manifest-dimension mapping turns out to be too brittle. Keep as Plan B.
(e) Enforce going forward via check-osmosis.sh, accept the 216-column debt
Bump check-osmosis.sh (or add a sister gate) to fail when an indicators__* model has a column without a description: line. Don't fix the 216 today; require all NEW columns to be documented; over time the debt clears as sources are touched.
- Pro: stops the bleeding. Zero immediate fixing required.
- Pro: the gate is cheap (one grep).
- Con: doesn't help today's AI / MCP consumers. The 216 stay missing for months/years.
- Con: makes adding a new ingest source slightly more painful (the contributor now has to author column descriptions hand-by-hand, which is the labor (b) explicitly avoids).
- Verdict: a reasonable complement to (b) — generator fills today's gap, the gate prevents tomorrow's. Not a standalone answer.
Decision
(b) + (e) — generator + CI gate. This is the only option that delivers the "single editorial input → every consumer fed" end-state described above; the others either keep contributor labor proportional to source count (a, d) or punt on the open-by-default gap (c) or stop the bleeding without healing it (e alone).
Specifically:
- Author
atlas-data/dbt/conformed-column-descriptions.yml— one entry per atlas-conformed column (~7 columns × ~1 paragraph each). One-time editorial pass; ~30 min. - Build
atlas-data/dbt/scripts/generate-indicator-descriptions.py— reads manifest.yml dimensions + the conformed dict, emits a regeneratedmodels/indicators/schema.yml. ~150 lines. - Add per-source override map in the script for column-name mismatches (
Region→region_codeetc). ~15 entries to start (Q1 settles the shape). - Run the generator; commit the regenerated schema.yml; verify with
npm run dbt:rebuildthat pg_description gains the missing descriptions. - Extend
check-osmosis.sh(or add a sibling check) to fail ifmodels/indicators/schema.ymldoesn't match the regenerated output (deterministic — likedbt parsechecks).
Why not (c): the open-by-default posture is recently load-bearing; lifting marts.* to a first-class API surface and then declining to document it is incoherent.
Why not (a): the conformed-column tax compounds with each new ingest source. As the cloud-agent pipeline (opened 2026-05-04) onboards more sources, (a) keeps contributor labor proportional to source count forever — exactly the lock-in this INVESTIGATE is set up to avoid.
Why not (d) only: misses the auto-population of source-specific columns. Contributor still has to author per-column descriptions every time, which is the cost (b) eliminates.
Resolved sub-decisions (Q1-Q5)
All five settled 2026-05-10. Original options + analysis preserved for historical record; the Resolved → line at the top of each is the firm answer.
-
[Q1] Column-name → manifest-dimension mapping shape.
- Resolved (initial 2026-05-10) → (ii) explicit
column_map:field on manifest.yml. - Refined post-spike (2026-05-10) →
derives:field per dim +enriched_columns:top-level + period-pattern templates in the conformed-cols dict. The originalcolumn_map:design assumed a 1:1 dim→column relationship that doesn't hold across all sources (fhi-mobbing'sAARderives 4 columns; bufdir has join-enriched columns with no upstream dim). The refined shape:- Each dim's
derives:block maps the dim to one or more schema columns;derives: { col: ~ }inherits the dim's meaning+value_format+notes (the 1:1 case);derives: { col: "..." }overrides with a column-specific description. derived_via: <template>references a shared period-pattern template (e.g.rolling_period_3yr,annual) that defines descriptions for all derived columns the pattern produces. Avoids re-authoring identical period descriptions across 3-year-rolling FHI sources.- Top-level
enriched_columns:declares descriptions for join-derived columns that don't trace back to an upstream dim.
- Each dim's
- Decided over (i) auto-derive (fragile against CTEs/macros) and (iii) central override file (drift across files). The
derives:shape keeps the dim's editorial content next to its column-list; templates absorb the cross-source repetition without forcing a centralised override file. - Loss accepted: contributors author
derives:per dim +derived_via:references + occasionallyenriched_columns:. ~5-8 manifest fields per source on average vs the original "zero-touch" promise. Reliability + full coverage win.
- Resolved (initial 2026-05-10) → (ii) explicit
-
[Q2] Conformed-columns dictionary location.
- Resolved → (i) standalone YAML at
atlas-data/dbt/conformed-column-descriptions.yml. Decided over (ii) seed CSV (multi-line prose descriptions are awful in CSV) and (iii) inline-with-marker (generator-preserved fragments tend to drift). - Loss accepted: the conformed columns aren't queryable via PostgREST in v1 (which a seed-CSV would have given for free, useful for MCP agents discovering Atlas's standard column semantics). Recoverable later: the generator can emit a derived seed CSV as a side-effect of the YAML; not blocking; not v1.
- Resolved → (i) standalone YAML at
-
[Q3] Cover
supply__*per-NGO models?- Resolved → No, scope to
indicators__*for v1, but build the generator with a model-family registry so addingsupply__*later is a config change, not a refactor. The generator has an enum of registered model families (initially:indicators); each family points at its own conformed-cols dict path. v1 ships with onlyindicatorsregistered. When folkehjelp ships and supply has 2+ models, registeringsupply__*is one PR away. Estimated extra cost: ~30 LOC vs hard-codingindicators__*everywhere. - Loss accepted:
supply__redcross_branches(today) +supply__folkehjelp_chapters(incoming) keep the description gap until v2 of this generator. Real, but bounded — the architectural cost of supply registration is small thanks to the registry shape, so v2 is fast.
- Resolved → No, scope to
-
[Q4] Migration of
indicators__ssb_08764's 3 hand-written descriptions.- Resolved → (iii) merge into manifest, regenerate. Verified 2026-05-10 that the 3 hand-written descriptions (
region_code,year,contents_code) are functionally equivalent to ssb-08764's manifest dimensions (Region,Tid,ContentsCode) — which actually have richer content (value_format+notesfields the hand-written versions lack). Generator output format:<meaning>. Value format: <value_format>. <notes>→ strictly richer than today's hand-written. No information loss; net gain. - Loss accepted: hand-tuning a single column's description is no longer possible — any future "this phrasing is better" change goes in
manifest.yml. Tiny loss; usually a feature (manifest descriptions flow tometa_dimensions, sources detail page, and the catalog all at once).
- Resolved → (iii) merge into manifest, regenerate. Verified 2026-05-10 that the 3 hand-written descriptions (
-
[Q5] Generalise beyond
indicators__*todim_*,fact_*,mart_*?- Resolved → No, scope to
indicators__*for v1.dim_*/fact_*/mart_*are already well-documented becausecheck-osmosis.shenforces hygiene there. The gap was specifically the auto-generated-styleindicators__*pass-throughs that escaped the gate. Generalising would either replace hand-written content (lossy) or duplicate it (drift). The model-family registry from [Q3] does keep the door open if the calculus changes. - Loss accepted: smallest loss of the five — universal coverage isn't valuable when the universal already exists.
- Resolved → No, scope to
Out of scope
- Editing
manifest.ymldimensions to fill gaps. If a source'sdimensions:block is incomplete, that's a separate authorship gap covered byingest-modules.md's contributor checklist. This INVESTIGATE assumes manifest dimensions are reasonably complete (they are for the 38+ sources currently shipped). - dbt-osmosis itself. The propagation pipeline is fine; the input upstream of it (column descriptions on
indicators__*) is what's missing. This INVESTIGATE doesn't touch osmosis config. - Raw schema descriptions. Raw.* descriptions come from migration SQL
COMMENT ON COLUMNstatements, not from dbt sources YAML. Out of scope here — separate concern. - api_v1. descriptions.* Already 100 % covered by the
apply-api-v1.shgenerator; no work needed. - Documenting
marts._*private seeds (_sources_manifest,_sources_dimensions,lineage,eu_data_theme). Already covered byseeds/sources/schema.yml; the persist_docs change in PR #89 already pushes those to pg_description afterdbt seed.
Cross-references
- PLAN-007 Phase 2.11 — introduced the
dimensions:block onmanifest.ymlthat this generator would consume. - PLAN-008 — Atlas-native developer discovery — Phase 2's lineage panel reads
meta_endpoints; the descriptions surfaced via persist_docs feed every endpoint card. atlas-data/dbt/check-osmosis.sh— the existing description-coverage gate this generator would extend.atlas-data/dbt/dbt_project.yml—+persist_docsconfig (PR #89) that makes column descriptions show up in pg_description.- PR #89 — landed
+persist_docs; the live verification of that PR is what surfaced the 249-column gap. atlas-data/dbt/models/indicators/schema.yml— the file this generator would regenerate.
Next steps
- (b) + (e) decision settled (2026-05-10).
- Q1-Q5 resolved (2026-05-10) — see "Resolved sub-decisions" above.
- Spike done (2026-05-10) — three real issues found; manifest schema refined to
derives:+enriched_columns:+ period-pattern templates. See "Spike findings" section above. Q1's resolution updated to absorb the changes. - Draft
PLAN-indicators-schema-generator.mdagainst the refined design. Phases:- Author
atlas-data/dbt/conformed-column-descriptions.yml—conformed_columns:(~7 entries) +period_templates:(annual+rolling_period_3yrfor v1; futurequarterly/monthlyadded incrementally). - Build generator with model-family registry —
indicators__*registered; reads dims (withderives:/derived_via:), enriched_columns, and the conformed-cols + templates dict. - Migrate all 23 manifests — add
derives:per dim (often~for 1:1) andderived_via:for period-rolling sources; addenriched_columns:to bufdir-barnefattigdom. - Regenerate
models/indicators/schema.yml; verify vianpm run dbt:rebuildthat pg_description gains the missing descriptions. - Extend
check-osmosis.shwith a generator-diff gate (committed schema.yml must match regenerated output). - Wire
schema-genphase into bootstrap +dbt:rebuild(betweenseedandrun).
- Author
- On PLAN ship, this INVESTIGATE moves backlog/ → completed/.
— signed, the Atlas implementation team (via Claude Code agent), 2026-05-09