Plan: /data shows everything that isn't gated, organised by tags
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Completed 2026-05-07 (Phases 1-5 shipped)
Last Updated: 2026-05-07 (Phase 1 closed: UIS PR #140 merged + GHCR republished; Phase 4 task 4.3 partial via PR #79; this update reflects the actual ship state vs the original plan).
What's done
-
Phase 1 — UIS schema exposure: ✅ closed 2026-05-07. UIS shipped the
--schemas api_v1,marts,rawflag (per-app explicit opt-in, not the original "global default" framing). UIS PR #140 merged asf377fef;ghcr.io/helpers-no/uis-provision-host:latest@sha256:42cd40d5f66916a6f6071ab4d69fcf0080a2915b1cf93295bd3b169b8af42f31. Atlas'ssetup.mdupdated via PR #76. Schema-list flag thread documented end-to-end intalk.md(Messages 1-4 atlas + 1-3 uis). The reconfigure-already-deployed step is user-managed via the UIS tester CLI. -
Phase 2 — manifest registry: ✅ shipped via PR #36 (catalogue 21 → 38 sources; manifest schema includes
eu_theme,attribution,dimensions:block;recordIngestRun()lifecycle wrapper landed;_sources_manifest.csv+_sources_dimensions.csvseeds materialised atmarts._sources_manifest/_sources_dimensions). Catalogue is now 41 sources after subsequent FHI / SSB / Bufdir / Cursor BG additions. -
Phase 3 — meta marts + auto-wrap: ✅ shipped via PR #73 (3 new marts under
models/marts/api/) + PR #77 (override-map → manifest.ymlraw_tables:field refactor).api_v1.meta_sources(41 rows),api_v1.meta_endpoints(121 rows after refresh: 13 api_v1 + 61 marts + 47 raw),api_v1.meta_dimensions(215 rows). Lineage seed via newscripts/extract_lineage.py(129 edges). Tag inheritance uses union semantics;fact_kommune_indicatorspicks up 18 tags from its many indicator sources.mart_meta_dimensionscardinality enrichment deferred to a follow-up — see INVESTIGATE-mart-meta-dimensions-cardinality.md (PR #78) for the design. -
Phase 4 — frontend rewrite: ✅ closed 2026-05-07. PR #79 shipped sources index. PR #85 shipped task 4.1 (full
/datarewrite with tag-filter sidebar againstmeta_endpoints, 119 cards, 6 namespace-grouped facets, faceted-search counts) + the/data/[endpoint]/→/data/[schema]/[table]/route restructure (Accept-Profile dispatch) + the cache-no-store fix for first-load-empty + homepage copy update (4.4). PR #88 (this PR) shipped task 4.3 (per-source detail page at/data/sources/[source_id]/page.tsxrendering source metadata + freshness + raw ingest link + derived endpoints joined live against themarts.lineageseed) + task 4.5 (atlas-frontend README refresh covering every route, the lib'sacceptProfileoption, and the bookmarkable Tag URL patterns). Task 4.2 closed as no-op — the tag-driven catalog reads dynamic schemas frommeta_endpoints, the typedapi-types.tsunion is no longer load-bearing for discovery; regen is now an optional contributor maintenance step. -
Phase 5 — docs: ✅ closed 2026-05-07 in PR #88. Four files updated:
setup.md(manifest convention + corrected customer-frontend section),ingest-modules.md(expanded manifest workflow with heuristic warnings),developers/index.md(open-by-default + Accept-Profile + tag-filter URL pattern),atlas-data/ingest/src/sources/README.md(added programmatic-access callout pointing atapi_v1.meta_sources).
Goal (unchanged)
Execute INVESTIGATE-customer-frontend-data-display.md. After this PLAN, the customer frontend's /data page shows every queryable endpoint across api_v1, marts, and raw schemas (everything that isn't private_marts), each tagged with provider, topic, geo, cadence, eu_theme, and layer. A filter sidebar lets users slice the catalogue by any combination of tags. A first-class sources list (/data/sources + api_v1.meta_sources) carries provider, upstream URL, last-ingested timestamp, and downstream-model count for every Atlas ingest source — currently 41, growing as the cloud-agent pipeline drains the backlog.
Investigation
INVESTIGATE-customer-frontend-data-display.md — settled the open-by-default principle, the per-source manifest.yml shape ([Q2]), the dbt-model-as-substrate path ([Q3]), and the multi-namespace tag UX ([Q4]). Phase 2.10 + 2.11 extended the namespace set with eu_theme: (DCAT-AP alignment) and the editorial dimensions: block.
Prerequisites
- ✅ PostgREST live with
api_v1.*(PLAN-004 + UIS PLAN-002 — verified 2026-04-30). - ✅ PostgREST also serves
marts.*andraw.*viaAccept-Profileheader (UIS PR #140 — Phase 1 of this PLAN). - ✅ Customer frontend with
/data,/data/[endpoint],/data/[endpoint]/spec(PLAN-005 — shipped at2266f21). - ✅
raw.ingest_runspopulated by every ingest module (lifecycle wrapper from Phase 2.8). - ✅
api_v1.meta_sources/meta_endpoints/meta_dimensionslive (Phase 3).
Blocks
- None remaining — UIS Phase 1 dependency closed 2026-05-07.
The manifest.yml shape
Phase 2's deliverable. One file per source folder. All structured catalogue metadata lives here; per-source READMEs are prose-only (what the script does, quirks, TODOs, references). After commit, the manifest is human-authored — ingest runs do NOT modify it.
# atlas-data/ingest/src/sources/ssb-08764/manifest.yml
source_id: ssb-08764
upstream_id: "08764"
upstream_url: https://www.ssb.no/statbank/table/08764
upstream_title: "08764: Personer under 18 år i husholdninger med lavinntekt (EU- og OECD-skala). Antall og prosent (K) (B) 2005-2024"
description: "Ingestion module for SSB statistikkbanktabell 08764 — Personer under 18 år i husholdninger med lavinntekt (EU- og OECD-skala)."
publisher: Statistisk sentralbyrå
license: NLOD
license_url: https://data.norge.no/nlod/no/2.0
periodicity: P1Y
eu_theme: SOCI
attribution: "Kilde: Statistisk sentralbyrå, tabell 08764"
tags:
provider: ssb
topic: income
geo: kommune
cadence: annual
dimensions:
- code: Region
meaning: Region (national / fylke / kommune / bydel / historical)
value_format: "Numeric code: 0 national, 2-digit fylke, 4-digit kommune, 6-digit bydel"
notes: "~1036 codes when pulling full range"
- code: ContentsCode
meaning: Statistic measure
value_format: 5 codes
notes: "Personer (count under 18), EUskala50/EUskala60 (% under 18 below 50%/60% of median, EU scale), OECDskala50/OECDskala60 (same, OECD scale)."
- code: Tid
meaning: Year
value_format: 4-digit year as text
notes: 2005–2024 (20 years); default v2-beta response is latest year only
Required top-level fields:
| Field | Purpose |
|---|---|
source_id | Folder name; primary key; e.g. ssb-08764. |
upstream_id | The upstream's own identifier (SSB table number, FHI dataset slug, etc.) so external developers can reconcile against upstream catalogues. |
upstream_url | Canonical link to the source on the upstream's own site. |
upstream_title | The source's authoritative title — usually Norwegian, sometimes bilingual. Gives developers something to search for in upstream tooling. |
description | One paragraph framing the dataset for the customer-facing catalogue. |
publisher | Institution that publishes the data (often = provider but sometimes different — e.g. an SSB table published on behalf of another body). |
license + license_url | Critical for external developers building products. Default NLOD (Norwegian Licence for Open Government Data) for Norwegian public-sector sources; declare explicitly so consumers don't guess. |
periodicity | ISO 8601 — P1Y annual, P3M quarterly, P1M monthly, P1D daily, irregular for ad-hoc / one-shot. More precise than the cadence: tag. |
eu_theme | EU Publications Office Data Theme code (one of: AGRI, ECON, EDUC, ENER, ENVI, GOVE, HEAL, INTR, JUST, REGI, SOCI, TECH, TRAN). Aligns Atlas with Felles datakatalog (DCAT-AP dcat:theme). Auto-derived from tags.topic by fill-manifest-todos.ts; lookup table at seeds/sources/eu_data_theme.csv. |
attribution | Citation string for academic / legal compliance (e.g. Kilde: Statistisk sentralbyrå, tabell 08764). External developers must use this when republishing or citing data. |
The tags: map carries the four declared namespaces (provider, topic, geo, cadence) — exactly one value per namespace per source. The cadence: tag is the human-readable shorthand of periodicity (so users can filter by cadence:annual without writing ISO 8601).
The dimensions: list carries one entry per upstream dimension with code (upstream's own dimension name), meaning (short human-readable interpretation), value_format (encoding of values), notes (cardinality, gotchas). Hand-authored — this is editorial semantic content the catalogue can't compute. Phase 3's mart_meta_dimensions joins this with computed cardinality + example values from raw.* tables.
The layer: namespace is not declared here; it's derived per-endpoint in Phase 3 from the schema + dbt model path.
Plus a sibling change to capture upstream freshness: a new column on raw.ingest_runs named upstream_updated_at timestamptz (nullable). The shared recordIngestRun() wrapper at lib/ingest_run.ts extracts the upstream's own "updated" timestamp from the JSON-stat2 response and writes it. The lag between MAX(finished_at) (we ingested) and MAX(upstream_updated_at) (they published) is a meaningful signal in mart_meta_sources.
How a manifest.yml gets created — bootstrap once, human-authored after
Three-stage workflow:
(1) Skeleton — automatic. npm run sources:bootstrap-manifest -- <source_id> fetches upstream metadata and writes a starter manifest with source_id, upstream_id, upstream_url, upstream_title, publisher, periodicity, license (NLOD default for SSB / FHI / KOSTRA) populated. Other fields left as TODO.
(2) Auto-fillable fields — automatic. npm run sources:fill-manifest-todos parses the per-source README and fills description, attribution, and the four tags: namespaces (topic via regex first-match-wins; cadence derived from periodicity; geo via priority kommune > fylke > bydel; eu_theme derived from topic via static map). Idempotent — only fills TODO/empty fields.
(3) Editorial — hand-authored. The contributor authors the dimensions: list by hand (semantic content the catalogue can't derive). Reviews the auto-filled fields. Commits.
After commit the manifest is human-authored — ingest runs do NOT modify it. npm run ingest:<source_id> reads upstream data, writes rows to raw.<source_id>, captures upstream_updated_at to raw.ingest_runs — but does not touch manifest.yml. Avoids "PR diff has mystery edits from a CI run."
For the 21 existing sources (Phase 2.3): same flow in batch. SSB (14) + FHI (4) cover via the bootstrap script — that's 18 sources auto-bootstrapped. The 3 outliers (redcross-branches, frr, ssb-klass-* if treated separately from SSB) use MANUAL_OVERRIDES in fill-manifest-todos.ts. Dimension blocks were hand-authored in ~30 minutes by reading each README's ## Response shape section.
Phase 1: UIS-side schema exposure
Cross-repo coordination with the UIS contributor. Atlas's atlas-postgrest instance starts serving marts.* and raw.* alongside api_v1.*. UIS-side change is a one-time configure-postgrest patch; after it lands, every new table in those schemas is queryable automatically (the existing ALTER DEFAULT PRIVILEGES clause already auto-grants SELECT on new tables).
Tasks
- 1.1 Open a new round of cross-repo coordination via
talk.md. Inaugural message from atlas to uis lays out the change asked for: extendPGRST_DB_SCHEMASfromapi_v1toapi_v1,marts,raw; add matchingGRANT USAGE ON SCHEMA marts, raw TO <app>_web_anonandGRANT SELECT ON ALL TABLES IN SCHEMA marts, raw TO <app>_web_anonplusALTER DEFAULT PRIVILEGES IN SCHEMA marts, raw GRANT SELECT ON TABLES TO <app>_web_anontoconfigure-postgrest.sh.private_martsstays excluded. - 1.2 UIS contributor responded + shipped. Six-message thread in
talk.mdsettled the design (UIS pushed back on the global-default framing in their Message 1; atlas accepted in Message 3 — the per-app--schemasflag avoids the GRANT-failure trap for non-Atlas consumers and keeps dbt-isms out of the platform tool). UIS PR #140 merged asf377fefon 2026-05-07; State Matrix dispatch with 5 reconcile paths;--schema(singular) removed entirely;PGRST_DB_SCHEMASlives on the per-app secret + read by deploy template viasecretKeyRefso configure/deploy can't drift. - 1.3 Atlas-side validation passed against the contributor's local-image deployment (
talk.mdMessage 4) — six spot-checks across api_v1 / marts / raw plus the privacy-boundary check confirmingprivate_marts.frr_resourcesreturns 404 by default and 406 withAccept-Profile: private_marts. Atlas'ssetup.mdupdated via PR #76 (configure line gains--schemas api_v1,marts,raw). The user's./uis pull+ reconfigure step is the final ack — runs through their UIS tester CLI; expected"status": "already_configured"no-op since the contributor's local image had identical semantics.
Outcome (Phase 1 — closed 2026-05-07): schema-list extension landed end-to-end. Single-day round-trip from atlas Message 4 (validation) to UIS Message 3 (PR + GHCR rebuild). PostgREST now serves marts.* and raw.* via Accept-Profile in addition to the default api_v1; private schemas (private_raw, private_marts) stay excluded by design. GHCR :latest SHA: 42cd40d5f66916a6f6071ab4d69fcf0080a2915b1cf93295bd3b169b8af42f31.
Operational gotcha logged for Phase 4 (talk40 Round 4 closeout): PostgREST routes header-less requests to the default schema only — the first one in --schemas, i.e. api_v1. To reach marts.* or raw.*, callers send Accept-Profile: <schema> on each request. Naive curl /dim_kommune returns 404 because PostgREST resolves it as api_v1.dim_kommune (which doesn't exist) — that's correct routing, not a misconfiguration. Symmetric for the OpenAPI document: curl / advertises only the default schema's ~14 paths; curl -H 'Accept-Profile: marts' / advertises ~64 marts paths. Sum across profiles ≈ 123, which matches what api_v1.meta_endpoints carries (119 rows after the latest regen). The Phase 4.1 frontend rewrite must send Accept-Profile per row, keyed off meta_endpoints.schema.
Validation
# Marts table is reachable via Accept-Profile
curl -fsS -H 'Accept-Profile: marts' "http://api-atlas.localhost/dim_kommune?limit=3" | jq 'length' # → 3
# Raw table is reachable via Accept-Profile
curl -fsS -H 'Accept-Profile: raw' "http://api-atlas.localhost/ssb_08764?limit=2" | jq 'length' # → 2
# private_marts.* is NOT reachable, even with explicit profile
curl -sS -o /dev/null -w "%{http_code}\n" "http://api-atlas.localhost/frr_resources" # → 404
curl -sS -o /dev/null -w "%{http_code}\n" -H 'Accept-Profile: private_marts' "http://api-atlas.localhost/frr_resources" # → 406
# OpenAPI: default profile (api_v1) advertises ~14 paths; multi-schema sum is exposed via meta_endpoints
curl -sS "http://api-atlas.localhost/" | jq '.paths | keys | length' # → 14 (api_v1 default)
curl -sS -i "http://api-atlas.localhost/meta_endpoints?limit=0" -H 'Prefer: count=exact' | grep -i 'content-range' # → */119
Done when
marts.*andraw.*tables are queryable viaapi-atlas.localhost.private_marts.*returns 404 (still gated).- Customer frontend's existing
/datacatalog auto-discovers the new endpoints on next page load (no code change needed; introspection-driven design from PLAN-005 handles it).
Phase 2: Per-source manifest.yml registry
Promote the existing Markdown table at atlas-data/ingest/src/sources/README.md and the per-source READMEs into structured per-source manifest.yml files. First-pass tag curation across the 21 existing sources.
Tasks
-
2.1 Document the
manifest.ymlschema inatlas-data/ingest/src/sources/README.md's "Conventions" section: the eight required top-level fields (source_id,upstream_id,upstream_url,upstream_title,description,publisher,license,license_url,periodicity), the requiredtags:map with the four declared namespaces, allowed values per namespace. -
2.2 Define the initial vocabulary for each tag namespace:
provider:ssb/fhi/redcross/brreg(extend as new providers land)topic:demographics/income/education/health/social/ngo-supply/reference(initial coarse set; refine as needed)geo:kommune/fylke/national/bydelcadence:annual/quarterly/monthly/irregular/one-shot- License values:
NLOD(Norwegian Licence for Open Government Data — the default for SSB / FHI / KOSTRA), or specific names for non-NLOD sources. Always declare the URL.
-
2.3 Build the bootstrap script at
atlas-data/ingest/scripts/bootstrap-manifest.ts(TypeScript so it reuses Atlas's existing ingest-side fetch helpers). CLI:npm run sources:bootstrap-manifest -- <source_id>. Provider-specific extractors:- SSB (PxWebAPI): GET the table metadata endpoint, map
title/source/updated/variables[*].labeltoupstream_title/publisher/(periodicity heuristic from variables —Tidvalue cardinality + spacing). Defaultlicense: NLOD. - FHI (Norgeshelsa json-stat2): same shape; metadata block has title + last-modified.
- Default fallback (no provider extractor): writes a template manifest.yml with TODO placeholders + the
source_id/upstream_id/upstream_urlfrom CLI args. Used forredcross-branches,frr, anything without a structured upstream API. Output: writesatlas-data/ingest/src/sources/<source_id>/manifest.ymlwith as much pre-filled as possible; leavesdescriptionandtagsas# TODOplaceholders for human review. Refuses to overwrite an existing manifest unless--forceis passed.
- SSB (PxWebAPI): GET the table metadata endpoint, map
-
2.4a Bootstrap the 21 existing sources — run
npm run sources:bootstrap-manifestfor each. SSB extractor handles 14 SSB tables + 2 ssb-klass sources. FHI extractor handles 4. Fallback template handles redcross-branches + frr (no provider API). Output: 21 skeleton YAMLs with upstream metadata pre-filled,description+tagsleft as# TODO. -
2.4b Build the auto-fill helper at
atlas-data/ingest/scripts/fill-manifest-todos.ts(extension to original plan — replaces the manual ~1-hour editorial pass). CLI:npm run sources:fill-manifest-todos(no per-source arg; runs across all sources idempotently). Reads each source'sREADME.mdand applies:description— first descriptive paragraph after the H1, with markdown emphasis/links/code stripped, ~400-char cap.upstream_id,upstream_title,license,license_url— parsed out of the README's## Upstreammarkdown table when present.tags.topic— first-match-wins regex overtitle + description. Order is significant:ngo-supplybeforereferencebeforeincome/education/health/social/demographics. Thehealthrule deliberately excludes the Norwegian wordhelse(because "Folkehelsestatistikk" — FHI's bureau name — would otherwise misclassify every FHI source). Thengo-supplyrule requires explicit NGO vocabulary (Røde Kors, lokallag, frivillig) — generic "tjeneste" or "aktivitet" alone are too broad.tags.geo— kommune > fylke > bydel priority. KOSTRA(K)markers count as kommune.tags.cadence— derived fromperiodicity(P1Y → annual, P3M → quarterly, etc.).MANUAL_OVERRIDESdict — hardcoded values forredcross-branchesandfrr, whose READMEs don't follow the SSB/FHI Upstream-table format.- Only fills TODO/empty fields; never overwrites human-authored content. After commit, the manifest is human-authored and ingest runs do NOT modify it (the discipline from Phase 2's preamble).
-
2.4c Run + verify —
npm run sources:bootstrap-manifestagainst each source folder, thennpm run sources:fill-manifest-todos(no per-source arg; runs across all 21). Spot-check the outputs; fix the topic/geo regex when classifications drift (e.g. ssb-12292 omsorgstjenester → health not ngo-supply, fhi-bor-alene → demographics not health). Commit the 21 manifests + both scripts as a batch. -
2.5 Add the seed-build helper at
atlas-data/dbt/scripts/build_sources_seed.pythat:- Scans
atlas-data/ingest/src/sources/*/manifest.yml - Validates each against the required-field list (fails loudly if any required field is missing or any tag namespace is absent — TODO placeholders also fail)
- Emits a dbt seed CSV at
atlas-data/dbt/seeds/sources/_sources_manifest.csvwith columnssource_id, upstream_id, upstream_url, upstream_title, description, publisher, license, license_url, periodicity, tags. Thetagscolumn is a comma-separatednamespace:valuestring (e.g.provider:ssb,topic:income,geo:kommune,cadence:annual).
Deviation (2026-05-01): Plan said
seeds/sources/manifest.csv+ alias to_sources_manifest. Implemented asseeds/sources/_sources_manifest.csvdirectly — no alias needed; the file's basename is the dbt resource name and the relation name (both_sources_manifest). Cleaner than the alias indirection. - Scans
-
2.6 Update
atlas-data/dbt/dbt_project.yml's seeds config soseeds/sources/_sources_manifest.csvlands inmarts._sources_manifest(private internal name; not user-facing).Implemented as: extend the
seeds.atlas.+column_typesmap with the nine manifest columns +tags.+schema: martsalready inherits from the parent. Added a separateseeds/sources/schema.ymldocumenting all ten columns + carryingnot_null/uniquedata tests.dbt seed --select _sources_manifestloads 21 rows;dbt test --select _sources_manifestpasses 11/11. -
2.7 Migration: add
atlas-data/migrations/NNN_raw_ingest_runs_upstream_updated.sqladdingupstream_updated_at timestamptztoraw.ingest_runs(nullable; idempotent viaADD COLUMN IF NOT EXISTS). Runnpm run migrateto apply.Landed as
atlas-data/migrations/028_raw_ingest_runs_upstream_updated.sql. -
2.8 Update SSB + FHI ingest modules (the easy wave) to populate
upstream_updated_at. SSB's PxWebAPI metadata returns anupdatedfield at the table level; FHI's json-stat2 has equivalent. The bootstrap script in 2.3 already extracts these — wire the same extraction into the runtime ingest path (one-place change in the run-record helper atatlas-data/ingest/src/lib/ingest-runs.tsor equivalent). Red Cross / Brreg can adopt the same convention later — leaving them null is fine; column is nullable.Outcome (2026-05-01): Scope was bigger than the plan implied — the existing SSB/FHI ingest modules didn't write to
raw.ingest_runsat all; the start/finish helpers were only used by the NGO scraping infrastructure. Built a new shared wrapper atatlas-data/ingest/src/lib/ingest_run.ts(recordIngestRun(sourceId, work)) that owns the start/finish + sql lifecycle, then wired all 21 source modules through it. Per-source delta is ~10 lines:return recordIngestRun(SOURCE_ID, async () => { ... return { output, record: { rowsParsed, upstreamUpdatedAt: new Date(resp.updated) } }; }). SSB (14) + FHI (4) populateupstreamUpdatedAtfromresp.updated; KLASS (2) + redcross/frr (2) pass null or a derived timestamp where the upstream concept exists. Live test:npm run ingest:ssb-08764returnedupstream_updated_at: "2026-01-16T07:00:00.000Z"onrun_id 2. -
2.9 Update
atlas-data/ingest/src/sources/README.md: either (a) auto-generate from the YAMLs viabuild_sources_seed.pyadding a markdown-table emission flag (one-way duplication, single source of truth in the YAMLs), or (b) replace the table with a pointer atapi_v1.meta_sources. Recommendation: (a) — contributors browsing the repo without the API still see a readable index, and the table can never go stale.Implemented option (a):
build_sources_seed.pynow accepts--readme [PATH](defaults toatlas-data/ingest/src/sources/README.md). Replaces content between<!-- BEGIN auto-generated source table -->/<!-- END auto-generated source table -->markers with a 7-column table (Source, Provider, What it is, Topic, EU theme, Geo, Cadence). Idempotent — re-running on an unchanged manifest set is a no-op. The legacyNotescolumn is dropped; per-source READMEs already capture editorial commentary. -
2.11 Invert the manifest/README contract (extension to original plan, prompted by user observation that the per-source README carried more structured info than
manifest.yml). After this step, all structured catalogue metadata lives inmanifest.yml; the README is reduced to prose-only contributor notes (what the script does, quirks, TODOs, references). Outcome:manifest.ymlschema gains two more required fields:attribution(citation string for academic/legal compliance, parsed from each README's## Upstreamtable or itsAttribution: *Kilde …*prose fallback) anddimensions:(list of{code, meaning, value_format, notes}per upstream dimension — semantic interpretation that a computedmart_meta_dimensionsfromraw.*can't produce on its own).fill-manifest-todos.tsextractsattributionautomatically;dimensions:is hand-authored once per source (cost: ~30 min for the 21-source backfill).build_sources_seed.pynow validatesattribution+ thedimensions:shape and emits a second seed atseeds/sources/_sources_dimensions.csv(90 rows × 21 sources). Lands asmarts._sources_dimensions. Phase 3'smart_meta_dimensionswill join this editorial seed with computed cardinality + example values fromraw.*.seeds/sources/schema.ymlgains_sources_dimensionswith arelationshipstest onsource_id → _sources_manifest.source_idandnot_nulloncode/meaning. dbt test: 24/24 passing.- Contributor guide at
website/docs/contributors/ingest-modules.mdrewritten — README sections required to be prose-only (drop the Markdown## Upstream,## Response shape,## Row shape emitted,## How to run locallyrequirements). Adding-a-source workflow now points contributors atnpm run sources:bootstrap-manifest+npm run sources:fill-manifest-todos+ hand-authoring thedimensions:block. Source for new entries: manifest.yml only — never duplicated in Markdown. - 21 existing READMEs slimmed by a one-off script: dropped the now-redundant structured sections, kept Title + What the script does + Known quirks + Known issues + References. Average: ~60% line reduction. fhi-bor-alene went 93 → 30 lines; ssb-08764 went 118 → 44 lines.
-
2.10 Add
eu_themefield +eu_data_themelookup seed (extension to original plan, per INVESTIGATE-felles-datakatalog-classification.md). Aligns Atlas with Felles datakatalog's EU-tema facet (DCAT-APdcat:theme) without giving up the domain-precisetopicfor our own UX.manifest.ymlschema gainseu_theme:top-level field — one of the 13 EU Publications Office Data Theme codes (AGRI / ECON / EDUC / ENER / ENVI / GOVE / HEAL / INTR / JUST / REGI / SOCI / TECH / TRAN). Required (validated bybuild_sources_seed.py); auto-derived fromtags.topicbyfill-manifest-todos.tsvia a staticTOPIC_TO_EU_THEMEmap.- New seed at
atlas-data/dbt/seeds/sources/eu_data_theme.csv— 13 rows × 4 columns (code,uri,label_en,label_no). URIs are stable EU IRIs (http://publications.europa.eu/resource/authority/data-theme/{CODE}). Lands asmarts.eu_data_theme. seeds/sources/schema.ymlgains the new seed's column tests + arelationshipstest on_sources_manifest.eu_theme → eu_data_theme.code(broken eu_theme values fail the gate).- Backfill: re-ran
npm run sources:fill-manifest-todosto addeu_theme:to all 21 manifests. Distribution: 14 SOCI (income/social/demographics/ngo-supply collapse), 4 EDUC, 2 GOVE (reference data), 1 HEAL. - Customer frontend in Phase 4 can render an "EU theme" filter alongside Atlas's domain
topic; later, a DCAT-AP-NO catalogue endpoint can re-emit these asdcat:themeURIs for federated discovery — see INVESTIGATE-felles-datakatalog-classification.md for the open question on DCAT-AP-NO publishing as a separate later PLAN.
Validation
# Every source folder has a manifest.yml (live count)
ls atlas-data/ingest/src/sources/*/manifest.yml | wc -l # → live count from current catalogue
# No remaining TODOs after the auto-fill pass
grep -l "TODO" atlas-data/ingest/src/sources/*/manifest.yml | wc -l # → 0
# Topic distribution looks plausible (no "ngo-supply" misclassifications across SSB/FHI)
grep -h " topic:" atlas-data/ingest/src/sources/*/manifest.yml | sort | uniq -c
# Re-running fill-manifest-todos is a no-op (idempotent — it only fills TODO/empty fields)
npm run sources:fill-manifest-todos # → "filled 0 of 21"
# Build the seed CSV; validation fails loudly if any required field is missing
cd atlas-data/dbt && python scripts/build_sources_seed.py
ls -la seeds/sources/manifest.csv # exists
# All four declared tag namespaces present per row
python -c "import csv; rows=list(csv.DictReader(open('seeds/sources/manifest.csv'))); print(all(set(t.split(':')[0] for t in r['tags'].split(',')) >= {'provider','topic','geo','cadence'} for r in rows))" # → True
# Migration applied
psql "$DATABASE_URL" -c "\d raw.ingest_runs" | grep upstream_updated_at # column visible
# SSB ingest modules write upstream_updated_at on next run
npm run ingest:ssb-08764
psql "$DATABASE_URL" -c "select source_slug, upstream_updated_at from raw.ingest_runs where source_slug='ssb-08764' order by run_id desc limit 1" # non-null
# dbt seed loads the manifest
uv run --env-file ../ingest/.env dbt seed --select _sources_manifest # success
Done when
- All 21 source folders contain a valid
manifest.ymlwith all eight required top-level fields and four tag namespaces. - No
TODOplaceholders remain in any manifest. bootstrap-manifest.ts+fill-manifest-todos.tsare both idempotent (re-running them is a no-op against a fully-populated state).build_sources_seed.pyproduces a clean CSV; validation rejects missing fields.raw.ingest_runs.upstream_updated_atmigration applied; nullable.- SSB ingest modules populate
upstream_updated_aton runs (14 sources). dbt seedloads the manifest intomarts._sources_manifest.- The legacy Markdown table at
atlas-data/ingest/src/sources/README.mdis either auto-generated from the YAMLs (preferred) or replaced with a pointer.
Phase 3: marts.meta_sources + marts.meta_endpoints + marts.meta_dimensions dbt models
The joins. After this phase, three new mart_* views exist (and via the PLAN-004 generator, three new api_v1.meta_* wrappers) that carry the full tagged catalogue: per-source metadata + freshness, per-endpoint inventory + tag inheritance, and per-dimension editorial semantics joined with computed cardinality.
Tasks
- 3.1 Add
atlas-data/dbt/models/marts/api/mart_meta_sources.sql:- From:
marts._sources_manifest(Phase 2 seed; currently 38 rows, growing) - Left-join to
raw.ingest_runsaggregates per source:last_ingested_at:MAX(finished_at) WHERE exit_code = 0last_upstream_update_at:MAX(upstream_updated_at) WHERE exit_code = 0(nullable — only populated for sources whose ingest module captures it)latest_row_count:rows_parsedfrom the most recent successful runtotal_runs:COUNT(*) FILTER (WHERE exit_code = 0)
- Add
downstream_model_count: count of distinct downstream models from the lineage seed (Phase 3.3). - Output columns:
source_id,upstream_id,upstream_url,upstream_landing_page,upstream_title,description,publisher,license,license_url,periodicity,eu_theme,attribution,tags(text[]),last_ingested_at,last_upstream_update_at,latest_row_count,total_runs,downstream_model_count. - Add full
schema.ymldescription per column (PLAN-001's gate enforces this).
- From:
- 3.2 Add
atlas-data/dbt/models/marts/api/mart_meta_endpoints.sql:- From:
information_schema.tablesfiltered totable_schema in ('api_v1','marts','raw')(andnot in ('private_marts')defensively). Skipmarts._*private seeds (_sources_manifest,_sources_dimensions,eu_data_theme,lineage). - Output columns:
endpoint,schema,table,tags(text[]),row_count(via dynamic SQL or a daily-refreshed snapshot — see 3.3 for lineage),is_public_api(boolean: schema='api_v1') - Tag derivation:
layer:<schema>from the schema; union of allprovider:/topic:/geo:/cadence:/eu_theme:tags from the source(s) the endpoint derives from (via the lineage seed in 3.3). Union over intersection: amart_*derived from 17 indicator sources picks up every source's tag — easier to filter, "this mart involves something annual" is a more useful signal than "this mart is purely annual." Decision recorded inline so 3.2 doesn't re-litigate it. - Add full
schema.ymldescription per column.
- From:
- 3.3 Add
atlas-data/dbt/scripts/extract_lineage.pythat readstarget/manifest.jsonafterdbt parse, walks the dependency graph from eachapi_v1.*andmarts.*model up to its rootraw.*ancestors, and emits a dbt seed CSV atseeds/sources/lineage.csvwith rows(model_name, source_id)— one row per (model, source) edge. Multiple rows per model when it derives from multiple sources (e.g.fact_kommune_indicators→ many indicator sources). Hardcoded multi-table override map shipped via PR #73; moved into manifest.yml'sraw_tables:field via PR #77 so the script is now generic. - [~] 3.4 Add
atlas-data/dbt/models/marts/api/mart_meta_dimensions.sql. Editorial pass-through shipped (PR #73);cardinality/example_values/null_countcolumns deferred — design in INVESTIGATE-mart-meta-dimensions-cardinality.md (PR #78).- From:
marts._sources_dimensions(Phase 2.11 seed; ~198 rows = sum of dimensions across all 38 sources). Left-joined to per-(source, dimension) introspection of the correspondingraw.*table. - For every (source_id, dim_code) pair, compute against the raw table:
cardinality:COUNT(DISTINCT <dim_column>)— how many unique values appear.example_values: array of up to ~10 distinct values (sorted by frequency desc, then alpha) for users to see what the dimension actually contains.null_count: rows where the dim value is null (should be 0 for non-degenerate dims).
- Output columns:
source_id,code(upstream dim name),meaning,value_format,notes(from the seed),cardinality,example_values(text[]),null_count. Frontend renders "what each column means × what values it actually contains" in one card. - Implementation note: introspecting raw.* tables means generating one SELECT per (source × dim) pair via dbt Jinja iteration over the seed contents. Use
run_query()at parse time to read the seed; build a per-source UNION ALL. Keep an eye on dbt-Core's parse-time query budget — if it slows, fall back to a static CTE per source the seed-gen script emits. - Add full
schema.ymldescription per column.
- From:
- 3.5 Run
./regenerate-api-v1.sh+./apply-api-v1.sh. The PLAN-004 generator picks upmart_meta_sources,mart_meta_endpoints, andmart_meta_dimensions, emitsapi_v1.meta_sources/api_v1.meta_endpoints/api_v1.meta_dimensionswrappers, all five validation gates pass. Wrapper count went 10 → 13.
Validation
Counts assume the catalogue at the moment of running. Substitute the live count from select count(*) from marts._sources_manifest; for any "X rows" assertion below — the catalogue grows continuously.
cd atlas-data/dbt
uv run --env-file ../ingest/.env dbt seed --select sources
uv run --env-file ../ingest/.env dbt run --select mart_meta_sources mart_meta_endpoints mart_meta_dimensions
./regenerate-api-v1.sh && ./apply-api-v1.sh
# meta_sources row count matches manifest seed
N=$(psql "$DATABASE_URL" -tAc 'select count(*) from marts._sources_manifest;')
curl -sS "http://api-atlas.localhost/meta_sources" | jq 'length' # → $N
# Every row has the required fields
curl -sS "http://api-atlas.localhost/meta_sources" | jq '[.[] | select(.license == null or .publisher == null or .upstream_title == null or .periodicity == null or .eu_theme == null)] | length' # → 0
# Every row has all five declared tag namespaces (provider/topic/geo/cadence/eu_theme)
curl -sS "http://api-atlas.localhost/meta_sources" | jq '[.[] | select(.tags | length < 5)] | length' # → 0
# Filter by tag
curl -sS "http://api-atlas.localhost/meta_sources?tags=cs.{provider:ssb}" | jq 'length' # > 0
# SSB sources have last_upstream_update_at populated
curl -sS "http://api-atlas.localhost/meta_sources?tags=cs.{provider:ssb}" | jq '[.[] | select(.last_upstream_update_at != null)] | length' # > 0 after a full ingest cycle
# meta_endpoints row count: roughly N indicators marts + dims + facts + supply marts + api_v1 wrappers
curl -sS "http://api-atlas.localhost/meta_endpoints" | jq 'length' # → ~80+ at 38 sources, grows linearly
# Endpoints inherit tags from sources
curl -sS "http://api-atlas.localhost/meta_endpoints?tags=cs.{topic:income}" | jq 'length' # > 0
# meta_dimensions has one row per (source × upstream-dimension); ~198 rows at 38 sources
curl -sS "http://api-atlas.localhost/meta_dimensions" | jq 'length' # > 0
curl -sS "http://api-atlas.localhost/meta_dimensions?source_id=eq.ssb-08764" | jq 'length' # → 3 (Region, ContentsCode, Tid)
# Every dimension row has cardinality + example_values populated
curl -sS "http://api-atlas.localhost/meta_dimensions" | jq '[.[] | select(.cardinality == null or (.example_values | length) == 0)] | length' # → 0
Done when
marts.meta_sourcesexists; each row hasprovider,topic,geo,cadence,eu_themetags +latest_run_atfromraw.ingest_runs. Row count =_sources_manifestrow count.marts.meta_endpointsexists with one row per public endpoint (skippingmarts._*private seeds +private_marts.*); each has alayer:tag plus inherited source tags via the union rule.marts.meta_dimensionsexists with one row per (source_id × upstream-dim); each has hand-authoredmeaning/value_format/notesjoined with computedcardinalityandexample_valuesfromraw.*.api_v1.meta_sources,api_v1.meta_endpoints, andapi_v1.meta_dimensionswrap them; all PLAN-004 validation gates pass.- PostgREST returns the same row counts under
Prefer: count=exactfor the threemeta_*endpoints.
Phase 4: Customer frontend /data rewrite + per-source detail
Replace the existing flat catalogue with the tag-filter sidebar layout. Add a per-source detail page.
Tasks
-
4.1 Rewrite
atlas-frontend/src/app/data/page.tsx:- ✅ Fetches
api_v1.meta_endpointsdirectly via server component (fetch()withnext: { revalidate: 60 }). - ✅ Reads
searchParams.tag(string | string[]) for active filters; URL state?tag=topic:income&tag=geo:kommune&q=oslo. - ✅ Filtering happens in node, not via PostgREST
?tags=cs.{...}(the original plan said server-side via PostgREST). Pivot rationale: meta_endpoints is 119 rows; node-side filter trivially fast and supports the full faceted-search semantics (AND across namespaces, OR within) without composing complex PostgRESTor=()clauses. Pure helpers extracted tosrc/lib/catalog-filter.tsfor testability. - ✅ Two-column layout: sidebar (18rem fixed) with 6 namespace-grouped checkboxes + faceted-search counts (counts re-compute as filters apply); endpoint cards on the right.
- ✅ Cards show layer-coloured badge (api_v1 emerald, marts sky, raw zinc),
table_namein mono, layer-stripped tag pills (clickable to add), and right-aligned "View as table" + "View spec" links to/data/{schema}/{table}and/data/{schema}/{table}/spec. - ✅ Pure URL-driven, no client JS, no React state.
Bundled scope (extension to original plan, atlas Phase 4.1 PR): the
/data/[endpoint]/route was restructured to/data/[schema]/[table]/so the table + spec viewers know which schema to sendAccept-Profilefor. Without this, marts/raw cards on the new catalog would 404 on click. The route is hard-cut (no back-compat redirect from/data/[endpoint]); old URLs aren't externally linked yet. The viewers'fetchSpec/fetchRows/fetchCountcalls all gain{ acceptProfile: schema }; the lib drops the header forapi_v1so default-schema requests stay header-less (matches the talk40 gotcha note above). - ✅ Fetches
-
4.2 Update
npm run api:typesso the newmeta_sourcesandmeta_endpointsendpoint types appear inapi-types.ts. Closed as no-op — Phase 4.1's catalog rewrite readsmeta_endpointsdynamically (the typedapi-types.tsunion is no longer load-bearing for catalog discovery), so the regen is a routine maintenance step the contributor runs whenever they want IDE autocompletion freshened. No explicit task. -
4.3 Per-source detail page shipped at
atlas-frontend/src/app/data/sources/[source_id]/page.tsx. Three parallel live PostgREST fetches (meta_sources filtered by source_id, meta_endpoints full list, marts.lineage filtered by source_id viaAccept-Profile: marts); 404 when source_id missing. Renders: source-metadata card (provider / license / periodicity / EU theme / upstream id / attribution), freshness card (last_ingested_at / last_upstream_update_at / total_runs / latest_row_count), tags as click-throughs to/data?tag=..., upstream link, raw-ingest-table card, derived-endpoints list withView as table+View specper row. PR #79's sources index now links source ids to this detail page; the prior interim direct-to-raw link is preserved in the action row asRaw data →. -
4.4 Homepage copy updated in PR #85: primary button reads "Browse all endpoints →" (was "Browse the data"), and a sibling "Sources →" button + caption distinguishes the two surfaces.
-
4.5
atlas-frontend/README.mdrefreshed: lists every route (homepage //data//data/[schema]/[table]//data/[schema]/[table]/spec//data/sources//data/sources/[source_id]); documentsacceptProfileon the lib helpers; adds a "Tag URLs" section with five bookmarkable example query strings; cross-links to PLAN-005 (initial split) and PLAN-007 (this PLAN's open-by-default rewrite).
Validation
cd atlas-frontend && npm run typecheck && npm run lint && npm run build # all clean
npm run dev # boots on :3001
# /data renders with the sidebar
curl -sS http://localhost:3001/data | grep -c "namespace-group\|filter-sidebar" # > 0
# Tag filter URL works
curl -sS "http://localhost:3001/data?tag=provider:ssb" | grep -oE "ssb-[0-9]+" | sort -u # 14 entries
# Per-source detail works
curl -sS -o /dev/null -w "%{http_code}\n" "http://localhost:3001/data/sources/ssb-08764" # → 200
curl -sS -o /dev/null -w "%{http_code}\n" "http://localhost:3001/data/sources/notreal" # → 404
Done when
/datarenders the tag-filter sidebar + cards layout against the live API.- Every public endpoint visible (when no filter active); filtering by any tag combination works via URL params.
/data/sources/[source_id]renders for valid source IDs; 404 for invalid.- All PLAN-005 routes (
/data/[endpoint]table viewer +/data/[endpoint]/specviewer) carry forward unchanged.
Phase 5: Docs
Tasks
- 5.1
setup.md: new "Per-sourcemanifest.yml" subsection after "Set up the ingest layer", documenting the 11 required top-level fields (incl.eu_theme,attribution), the fourtags:namespaces with allowed values, and thedimensions:block with example. Plus the "Set up the customer frontend" section refreshed to list the actual routes shipped (/data,/data/sources,/data/sources/[source_id],/data/[schema]/[table]) instead of the stale "lists every api_v1.* endpoint" copy. - 5.2
ingest-modules.md: the manifest workflow at step 4 expanded into 4 explicit sub-steps (bootstrap → fill → review-with-heuristic-warnings → commit). Names the topic-regex first-match-wins behaviour and the geo priority (kommune > fylke > bydel). DocumentsMANUAL_OVERRIDESfor sources without an## UpstreamREADME table. Closes with the "manifest is human-authored after commit; ingest runs don't modify it" rule. - 5.3
developers/index.md: new "Open by default" section explaining the three-schema posture (api_v1/marts/raw), how to reach non-default schemas viaAccept-Profile, the catalog-as-queryable-endpoint pattern, and a tag-filter URL pattern table (7 examples) external developers can use to deep-link filtered views. The customer-app section's description updated to reflect the multi-schema dispatch + lib'sacceptProfileoption. - 5.4
atlas-data/ingest/src/sources/README.md: the auto-generated table frombuild_sources_seed.py --readmewas already in place (BEGIN/END markers since Phase 2.9). Added a "Programmatic access" callout above the table pointing atapi_v1.meta_sourcesas the canonical live view, with acurlexample, so external developers see the API path alongside the offline Markdown table.
Validation
User reads the updated docs and confirms a new contributor could (a) author a manifest.yml for a new source and (b) understand the tag-driven catalogue without consulting this PLAN.
Done when
All four doc files reflect the new shape; no stale references to "the 9 endpoints" remain in contributor or developer surfaces.
Acceptance criteria
- PostgREST serves
api_v1.* + marts.* + raw.*(private_marts stays excluded). Verified viacurl api-atlas.localhost/dim_kommuneandcurl api-atlas.localhost/ssb_08764. - Every source folder in
atlas-data/ingest/src/sources/contains a validmanifest.ymlwith the required top-level fields (source_id,upstream_id,upstream_url,upstream_title,description,publisher,license,license_url,periodicity,eu_theme,attribution) + the four declared tag namespaces (provider,topic,geo,cadence) + a hand-authoreddimensions:block. (Phase 2 — currently 38 sources.) -
raw.ingest_runs.upstream_updated_atcolumn exists; therecordIngestRun()wrapper populates it for SSB / FHI sources. (Phase 2.) -
marts.meta_sources(and itsapi_v1.meta_sourceswrapper) carries one row per source in_sources_manifest, each with full tags + license + publisher + periodicity + eu_theme +last_ingested_at+last_upstream_update_at(where the source supports it). -
marts.meta_endpoints(and itsapi_v1.meta_endpointswrapper) carries one row per public endpoint, each withlayer:+ inherited source tags via the union rule. -
marts.meta_dimensions(and itsapi_v1.meta_dimensionswrapper) carries one row per (source × upstream-dimension), with hand-authoredmeaning/value_format/notesjoined with computedcardinalityandexample_valuesfromraw.*introspection. - All five PLAN-004 validation gates still pass (drift, coverage, static description, runtime description, row-count parity).
- Customer frontend
/datarenders the tag-filter sidebar + cards layout; URL state is bookmarkable; every public endpoint visible. -
/data/sources/[source_id]per-source detail renders for every source in_sources_manifest; 404 otherwise. - Contributor docs (
setup.md,ingest-modules.md) describe the manifest.yml convention + the tag namespaces + thedimensions:block. - Developer docs (
developers/index.md) describe the open-by-default principle + the tag-filter URL pattern.
Files to modify
New (atlas-data):
atlas-data/ingest/src/sources/<id>/manifest.yml— one per source, currently 38 (auto-bootstrapped + auto-filled + hand-authoreddimensions:block)atlas-data/ingest/scripts/bootstrap-manifest.ts— provider-aware bootstrap (SSB PxWebAPI extractor + FHI extractor + fallback template); npm aliassources:bootstrap-manifestatlas-data/ingest/scripts/fill-manifest-todos.ts— README-parsing TODO-filler (description, upstream_id, upstream_title, license, tags) with topic/geo regex rules +MANUAL_OVERRIDESfor redcross-branches/frr; npm aliassources:fill-manifest-todosatlas-data/ingest/src/lib/ingest_run.ts— sharedrecordIngestRun(sourceId, work)wrapper that owns start/finish + sql lifecycle; replaces the original "one-place change" planatlas-data/migrations/028_raw_ingest_runs_upstream_updated.sql— addsupstream_updated_atcolumnatlas-data/dbt/scripts/build_sources_seed.py— YAML scanner → dbt seed CSV (validates required fields, refuses TODO placeholders)atlas-data/dbt/scripts/extract_lineage.py—manifest.json→ lineage seed CSVatlas-data/dbt/seeds/sources/_sources_manifest.csv— generated, committedatlas-data/dbt/seeds/sources/_sources_dimensions.csv— 90-row editorial dimension reference (one row per source × dimension)atlas-data/dbt/seeds/sources/eu_data_theme.csv— 13-row EU Data Theme lookupatlas-data/dbt/seeds/sources/schema.yml— column descriptions + tests for all three seeds (incl. eu_theme→eu_data_theme.code + dimensions.source_id→_sources_manifest.source_id relationships)atlas-data/dbt/seeds/sources/lineage.csv— generated, committedatlas-data/dbt/models/marts/api/mart_meta_sources.sqlatlas-data/dbt/models/marts/api/mart_meta_endpoints.sqlatlas-data/dbt/models/marts/api/mart_meta_dimensions.sqlatlas-data/dbt/models/marts/api/schema.yml— descriptions for all three new models
Updated (atlas-data):
atlas-data/dbt/dbt_project.yml— seed config forseeds/sources/atlas-data/ingest/src/sources/README.md— auto-generated from YAMLs (or pointer)atlas-data/ingest/src/sources/<id>/index.ts— SSB modules updated to capture upstreamupdatedfield and pass to run-record helper (one shared code path)atlas-data/ingest/src/lib/ingest-runs.ts(or wherever the run-record write lives) — acceptsupstream_updated_atfrom caller, writes to the new column- Generated:
atlas-data/dbt/api_v1_generated.sql+api_v1_state.json(PLAN-004 generator output)
Updated (atlas-frontend):
atlas-frontend/src/app/data/page.tsx— rewritten as tag-filter sidebaratlas-frontend/src/app/data/sources/[source_id]/page.tsx— new per-source detail routeatlas-frontend/src/app/page.tsx— minor copy updateatlas-frontend/README.md— mention the tag-driven catalogue- Regenerated:
atlas-frontend/src/lib/api-types.ts(vianpm run api:types)
Updated docs:
website/docs/contributors/setup.mdwebsite/docs/contributors/ingest-modules.mdwebsite/docs/developers/index.md
UIS-side (cross-repo):
urbalurba-infrastructure/provision-host/uis/lib/configure-postgrest.sh— extendPGRST_DB_SCHEMAS+ grantsurbalurba-infrastructure/website/docs/services/integration/postgrest.md— document the new schema-set defaults
Out of scope
- The auth story for
private_marts.*— covered by INVESTIGATE-private-atlas-deployments.md. - Column-level descriptions on
raw.*tables — they remain undocumented; external consumers seemeta_sources.upstream_urlfor canonical docs. - Lineage visualisation (mermaid graphs) —
meta_endpointscarries the data; rendering a graph is a v2 polish. - Tag governance / curation tooling — manual for v1 (a quarterly review by whoever's stewarding source ingests). Automate later if it gets messy.
- Search-relevance scoring across sources — keep the existing free-text search; tags are the structured navigation, search is the unstructured complement.
Related work shipped during PLAN-007 execution
Captured here so the PLAN serves as project documentation, not just an aspirational checklist. Each entry is work that surfaced during the PLAN's execution but didn't fit a numbered task.
Catalogue growth + cloud-agent pipeline (parallel to Phase 2/3)
- Catalogue grew 21 → 41 sources during the FHI / Bufdir / SSB-crime onboarding waves (2026-04-30 → 2026-05-06). 17 FHI sources from human-driven onboarding; 4 ssb-crime tables + bufdir-barnefattigdom + ssb-10826 from Cursor BG cloud-agent runs.
- Cloud-agent runbook (
AGENT-onboard-source.md+.cursor/rules/onboard-source.mdc) shipped via PR #36, refined via subsequent commits to support both queue-mode (issue-claim) and named-candidate-mode invocations. npm run ingest:allcatch-up script + raw.ingest_runs validation shipped via PR #80 — discovers everyingest:*script in package.json, runs sequentially, validates each via therecordIngestRun()lifecycle wrapper, prints per-source row count + duration. Closes the post-reset workflow's "ingest 36 sources by hand" gap.
Bufdir hardening track (PRs #67, #68, #69, #70, #71)
- PR #67: split
bufdir-barnefattigdom/index.tsinto pureparse.ts+ 29 golden-file tests; multi-tier ZIP-URL discovery (canonical → loose-date-format → loose-monitor → loose-bare with logged fallback tier). - PR #71: surrogate
indicator_api_idmigration —bf_zip_<24-hex>→bf_zip_ind_<N>(number-prefix); alias seedbufdir_indicator_alias.csvfor renumbering events (Indikator 9 → 9a/9b split, Indikator 10 retired). Wraps via PLAN-004 generator asapi_v1.bufdir_indicator_alias. - The
lib/output.tsper-line streaming refactor (writeNdjson+ newndjsonStreamingWriter) shipped along the way to fix a V8Invalid string lengthcrash on bufdir's 395k-row output.
Cluster rebuild + setup workflow hardening (PRs #62, #65, #66)
- Postgres + UIS cluster wiped + rebuilt 2026-05-05 (rancher-desktop reset). Surfaced gaps in the post-reset workflow:
- PR #62:
setup.mdgains the docker-psql fallback for hosts withoutlibpq+ an explicit "After a cluster reset / fresh start" recovery sequence. - PR #66: Klass dim-spine ingests (ssb-klass-kommuner + ssb-klass-fylker) made mandatory in step 6 — without them, every
relationships → dim_kommunetest fails by definition (the dim builds but is empty). - PR #65: dbt-osmosis canonicalisation fix for the YAML-style drift introduced by Cursor BG's bufdir descriptions.
- PR #62:
Frontend scaffolding (PR #79; partial Phase 4 task 4.3)
/data/sources/page.tsx— sources index readingapi_v1.meta_sourceslive. Same introspection-driven pattern as PLAN-005's/datacatalog. Grouped by provider (pragmatic v1; full tag-filter sidebar is task 4.1).
Doc / process improvements (PRs #58, #59, #66, #76, #77)
- PR #59: validated INVESTIGATE + PLAN, added
mart_meta_dimensionsto Phase 3 task list (it was missing despite the seed being built for it). - PR #76:
setup.md--schemas api_v1,marts,rawflag added to./uis configure postgrestline (paired with UIS PR #140's flag landing). - PR #77: moved
extract_lineage.py's hardcoded multi-table override map into manifest.ymlraw_tables:field. Closes a follow-up flagged in PR #73's outcome notes. - PR #78: INVESTIGATE for
mart_meta_dimensionscardinality enrichment (deferred from Phase 3.4).
Open follow-ups (tracked outside PLAN-007)
mart_meta_dimensionscardinality + example_values + null_count — design in INVESTIGATE-mart-meta-dimensions-cardinality.md. Recommends Python extract script (analogous toextract_lineage.py) + optionalcolumn_name:field on each dim entry. Estimated half-day implementation once accepted.
Cross-references
- INVESTIGATE-customer-frontend-data-display.md — the architectural commitments this PLAN executes.
- PLAN-004-postgrest-api-v1-wrapper.md — the auto-generator wraps
mart_meta_*intoapi_v1.meta_*. This PLAN reuses that pipeline unchanged. - PLAN-005-frontend-split-and-rebuild.md — built the introspection-driven
/datacatalogue this PLAN extends. The existing/data/[endpoint]table viewer +/data/[endpoint]/specviewer are unaffected. - INVESTIGATE-frontend-data-access-architecture.md — established forkability + no-DB-role for the customer frontend; this PLAN preserves both.
atlas-data/ingest/src/sources/README.md— the legacy Markdown registry that becomes the structuredmanifest.ymlset in Phase 2.raw.ingest_runs— the run-history substratemeta_sourcesjoins to.talk.md— empty placeholder; Phase 1 opens a new round here.
Implementation notes
- Phase 1 had cross-repo asynchrony — and the parallel sequencing worked as intended. Atlas Phase 2 + 3 + Phase 4 task 4.3 (sources index) all shipped against the existing single-schema PostgREST while UIS's PR was in flight. UIS Message 1 pushed back on the original "global default" framing in favour of an explicit per-app
--schemasflag; atlas accepted in Message 3; UIS's PR #140 merged 2026-05-07 (single-day round-trip from atlas Message 4 validation to UIS Message 3 close-out). Total elapsed: 8 days from atlas Message 1 to Phase 1 close. Lesson for future cross-repo asks: validate against the contributor's local-image deployment before they push the PR — saved a CI round-trip here. - Tag inheritance — union, not intersection. Recorded inline at Phase 3.2. A
mart_*derived from many sources picks up the union of source tags so filters liketopic:incomesurface every mart that involves income data, not just marts where every source happens to be income-shaped. Don't re-litigate. marts._*private seeds stay out ofmart_meta_endpoints._sources_manifest,_sources_dimensions,eu_data_theme, and the futurelineageseed are dbt internals — they live inmarts(so models canref()them) but the underscore prefix marks them not-for-API. The auto-generator atregenerate-api-v1.shalready skips them by convention.mart_meta_endpoints'sinformation_schema.tablesquery needs an explicitWHERE table_name NOT LIKE '\_%'filter to match.- Editorial vs computed in
mart_meta_dimensions. The_sources_dimensionsseed carries hand-authored editorial content (meaning,value_format,notes— what the dimension is). The mart joins it with introspection ofraw.*(cardinality,example_values— what the dimension actually contains). Both are valuable; one without the other gives only half the picture. The seed is deliberately the only source of editorial truth — don't add computed fields to the seed itself, and don't add hand-authored fields to the introspection layer. - Don't over-engineer the lineage extraction. A flat
(model_name, source_id)seed is enough — recursive walks of the dbt graph happen at extract time, not at query time. PostgREST consumers seemeta_endpoints.tagsas an already-flattened array. - Catalogue grows continuously. Every Cursor BG run lands a new source. The PLAN's validation gates expressed as live
count(*)queries against_sources_manifestrather than fixed numbers — keeps the doc maintainable as the catalogue moves from 38 → 50 → 100+.