Adding a new data source
This is the end-to-end workflow for adding a new upstream data source to Atlas — from cataloguing the source to making it queryable through the public API. Written for an external contributor who hasn't seen the codebase before; experienced contributors can skim.
If you've never seen Atlas's data pipeline before, read the data journey walkthrough first — it traces one source (SSB 08764) end-to-end so you understand what the pieces are. Then come back here for the procedural steps.
What "adding a source" means in Atlas
Atlas's data side is two layers:
- Ingest — TypeScript modules under
atlas-data/ingest/src/sources/that fetch upstream data (SSB, FHI, Brreg, NGO HTML scrapes) and write to theraw.*Postgres schema. Verbatim shape — no renaming or reshaping. - dbt — SQL models under
atlas-data/dbt/models/that transformraw.*intomarts.*(the internal data layer; the external public API is PostgREST againstapi_v1.*wrapper views — see api-v1.md).
Adding a source means both: write the ingest module that lands rows in raw.<source>, and the dbt model that maps raw.<source> to a clean per-source mart in marts.indicators__<source>.
A typical SSB source takes ~30 minutes for the ingest side, ~20 minutes for the dbt side. Scraping sources (HTML, no API) take longer — see ingest-modules.md § scraping convention.
Prerequisites
Before you start:
- Read
docs/stack/naming-conventions.md— the canonical vocabulary (kommune_nr,fylke_nr,orgnr, etc.). Atlas does not invent variants; rule #5 says every column must have a description inschema.yml(enforced by the check-osmosis gate). - Skim the data journey walkthrough for SSB 08764 if you haven't already.
- Read
atlas-data/ingest/src/sources/ssb-08764/as the template — you'll copy and adapt this for SSB-style sources. - Have your dev environment set up — see setup.md.
If you're adding a scraping source (HTML, no API), the workflow below covers the API-source baseline; you also need:
- INVESTIGATE-ngo-scraping-infrastructure.md — design rationale (Crawlee, robots, sitemap_log, record_hash)
- ingest-modules.md § scraping convention — the extended folder layout (
discover.ts,parse.ts,__tests__/fixtures/) - PLAN-001-scraping-infrastructure.md — what shipped in
src/lib/scraping/
If you're covering a new NGO with Brreg data (legal-entity metadata), you do not add a new ingest source — the generic refresh:brreg-enheter already handles every NGO. Add a brreg_query block to the NGO's entry in landscape.json, then re-run npm run refresh:brreg-enheter. See ingest/src/lib/brreg/README.md.
The 11-step workflow
Execute these steps in order. Later steps depend on earlier ones — don't skip ahead.
Step 1 — Catalogue entry
File: docs/research/samfunnspuls/data-sources.md (or the broader docs/research/data-sources.md).
Required fields (minimum): id, provider, kind, title_no, what_it_is, use_cases (≥1), questions_answered (≥2), endpoint, auth, atlas_decision, verified_on.
Done when: entry exists and is syntactically valid per the schema file.
Step 2 — Investigate the upstream
Fetch upstream metadata and record in a scratch note (doesn't need committing):
- Dimensions and sizes
- Whether any dimensions require explicit selection (in SSB Klass:
elimination=false) - Row count estimate for the default query
- Update cadence
- Region code format — bare 4-digit, prefixed (
K_0301), or alphanumeric (030101a)?
Done when: a test curl or fetch returns valid data with the filters you plan to use.
Step 3 — Raw landing table migration
File: atlas-data/migrations/NNN_raw_<source_id_with_underscores>.sql (NNN = next free 3-digit number, zero-padded).
Required content:
create table if not exists raw.<source_id>— underscores in SQL identifiers, not hyphens- Columns match upstream shape — no renaming at this layer
- Composite
primary keyon all dimension columns loaded_at timestamptz not null default now()comment on tabledescribing the source in one sentencecomment on columnfor any non-obvious column (prefix codes, suppression markers, etc.)
Done when: npm run migrate applies the file without error.
Step 4 — Ingest module
Folder: atlas-data/ingest/src/sources/<source-id>/ — hyphens in folder names.
Two files: index.ts and README.md. The full template is in ingest-modules.md. Key constraints:
index.tsuses shared helpers from../../lib/*(no inlinewriteNdjson, no inline Postgres client, no hard-coded credentials).- Row type declared inline; do not add it to
lib/types.ts. SOURCE_IDconstant matches the catalogueidexactly.README.mdhas all 9 required sections (see ingest-modules.md § README structure).
Done when: npm run typecheck passes and npm run ingest:<source-id> writes rows to raw.<source_id>.
Step 5 — npm script
File: atlas-data/ingest/package.json.
Add to the scripts block, alphabetically among the other ingest:* entries:
"ingest:<source-id>": "tsx --env-file=.env src/sources/<source-id>/index.ts"
Step 6 — Source-list entry
File: atlas-data/ingest/src/sources/README.md.
Add one row to the "Implemented sources" table, alphabetically. Include: link to per-source folder, provider, one-sentence description, the npm run command, an important quirk (<120 chars).
Step 7 — dbt source declaration
File: atlas-data/dbt/models/indicators/sources.yml (or dimensions/sources.yml for dimensions).
Add a tables[] entry under the raw source with: name, description (2–3 sentences), loaded_at_field: loaded_at, freshness block (warn_after + error_after), columns: with not_null tests on PK columns, accepted_values where the value set is bounded.
Step 8 — dbt per-source model
File: atlas-data/dbt/models/indicators/indicators__<source_id_with_underscores>.sql.
Required content:
{{ config(materialized='table', schema='marts') }}at topselect 'ssb-XXXXX'::text as source_id,as the first column — hard-coded literal matching the catalogueid- Canonical column names per naming-conventions.md — rename upstream names as needed
- For sources with prefixed region codes (like
ssb-06913), strip the prefix here - For mixed-level sources (kommune + fylke + nasjon rows), add a computed
kommune_nr:case when length(region_code) = 4 then region_code end as kommune_nr loaded_at as updated_at(never exposeloaded_atin marts)
Done when: dbt run --select indicators__<source_id> succeeds.
Step 9 — dbt schema.yml entry
File: atlas-data/dbt/models/indicators/schema.yml.
Add a models[] entry with:
- Model-level
description - Per-column
descriptionfor every column (MUST — see check-osmosis.md) not_nullon every PK columnaccepted_valueswhere the set is boundeddbt_utils.accepted_rangeon year-like columnsdbt_utils.unique_combination_of_columnson the model-level PKrelationshipstest onkommune_nr→ref('dim_kommune')if presentrelationshipstest onfylke_nr→ref('dim_fylke')if presentrelationshipstest onorgnr→ref('dim_ngo')if present
Done when: dbt-osmosis yaml document --dry-run --check exits 0. See dbt-osmosis.md for the description-propagation flow.
Step 9b — Regenerate api_v1 if your model is under models/marts/api/
If (and only if) your new model is a mart_<feature> view living under models/marts/api/ — i.e. you're adding to the public API surface — re-run the generator:
cd atlas-data/dbt
./regenerate-api-v1.sh
This updates atlas-data/dbt/api_v1_generated.sql and api_v1_state.json. Inspect the diff: a new CREATE OR REPLACE VIEW api_v1.<your_view> block plus per-column COMMENT ON COLUMN lines should appear. Also update atlas-data/dbt/tests/api_v1_rowcount_matches_marts.sql — add a union all line for the new view pair (the test is hand-maintained today; future iteration may auto-generate it).
If your model is in models/marts/ but NOT under models/marts/api/ (e.g. a fact or internal mart consumed by other dbt models), skip this step — that view is internal-only.
The drift gate at ./check-api-v1.sh enforces this; naming-conventions rule #9 codifies the convention. See api-v1.md for the layer's design rationale.
Step 10 — Verify end-to-end
Run, in order, all of:
cd atlas-data/ingest
npm run typecheck # must pass
npm run migrate # must succeed (idempotent)
npm run ingest:<source-id> # must succeed
cd ../dbt
uv run --env-file ../ingest/.env dbt run --select indicators__<source_id> # must succeed
uv run --env-file ../ingest/.env dbt test --select indicators__<source_id> # must pass 100%
./check-osmosis.sh # must pass
./check-api-v1.sh # must pass (only if step 9b applies)
./apply-api-v1.sh # only if step 9b applies; idempotent
uv run --env-file ../ingest/.env dbt test --select api_v1_descriptions_complete api_v1_rowcount_matches_marts # only if step 9b applies
If any command fails or any test warns, fix before committing. No "we'll clean up later."
Step 11 — Commit
Commit message format: Add <source-id> (<one-line summary>).
Workflow: add a new dim_* (conformed dimension)
Similar to the source workflow, with these differences:
- Step 8 puts the model in
dbt/models/dimensions/, notindicators/. - The dim's source declaration in
sources.ymllives in the same folder. - The model's schema.yml entry lives in
dbt/models/dimensions/schema.yml. - Every existing indicator that references the dim MUST add a
relationships:test in the same PR. - If the dim introduces a new canonical column name, update
docs/stack/naming-conventions.mdin the same PR.
Workflow: modify an existing source
- Raw column changes: add a new migration file — never edit the existing one.
- Model column changes: edit the model file directly; update
schema.ymlin the same change. If the model is undermodels/marts/api/, also re-runregenerate-api-v1.shand commit the regenerated artefacts (rule #9 in naming-conventions.md). - Renaming a column: do it in the dbt passthrough; raw stays as-is. For api/ models, the rename surfaces in
api_v1.<view>after regenerate — that's a breaking change for external consumers, treat carefully. - Retiring a source: add
deprecation_date: YYYY-MM-DDto the model config; delete after the date.
Workflow: deprecate then remove a mart_<feature> (api_v1 view)
A two-phase process — the api_v1 wrapper has external consumers and can't be dropped abruptly. See api-v1.md § Deprecating then removing for the full flow.
Phase A — Deprecate. Add a meta: block to the model's schema.yml entry indicating deprecation:
- name: mart_old_view
meta:
deprecated_until: '2026-12-01'
deprecated_reason: 'Replaced by mart_new_view; see PLAN-XXX'
Wrapper still serves traffic; consumers are notified via release notes. Grace period: at least one consumer-notice cycle.
Phase B — Remove. After grace period and confirmed no traffic: delete the model from models/marts/api/. Run ./regenerate-api-v1.sh — the generator notices the view is in api_v1_state.json but not in the current manifest, and emits DROP VIEW IF EXISTS api_v1.<name> CASCADE. Apply via ./apply-api-v1.sh. Remove the corresponding line from tests/api_v1_rowcount_matches_marts.sql.
PR checklist
Before opening a PR, verify every box. A reviewer (human or LLM) should reject a PR that fails any item.
Catalogue + upstream
- Source has a catalogue entry with all required fields
-
atlas_decisionis set (notevaluate_laterfor production work) -
verified_onis today's date
Raw layer
- Migration file added with sequential number
- Raw table has
primary keyon dimension columns - Raw table has
loaded_at timestamptz - Non-obvious columns have
comment on column
Ingest
-
index.tsuses shared libs, not inline helpers - Row type declared inline, not in
lib/types.ts -
SOURCE_IDconstant matches catalogueidexactly - All five
PxWebAPIerror paths considered (429, 5xx, schema change, rate limit, network) - README has all 9 required sections
-
npm run typecheckpasses with zero errors
dbt
- Source declared in
sources.ymlwithfreshnessblock - Per-source model materialized as
tableinmartsschema - First column of the model is
source_idliteral, matching catalogueid - All marts column names comply with naming-conventions.md
- No upstream name leaked into marts (checked against the "Never in marts" table)
- Every column has a
descriptioninschema.yml - PK columns have
not_nulltests - Model has
dbt_utils.unique_combination_of_columnson PK -
relationshipstest added for every*_nr,*_code,orgnrcolumn that references a dim -
accepted_valuestest on columns with bounded, known values -
accepted_rangetest on year columns -
dbt runsucceeds -
dbt testpasses 100% (zero failures, zero warnings) -
./check-osmosis.shpasses (strict ✓, TOTAL = 0)
Housekeeping
- npm script added and alphabetically sorted
-
ingest/src/sources/README.mdupdated - If new canonical vocabulary needed:
docs/stack/naming-conventions.mdupdated - Commit message follows the format
Rules you MUST NOT break
- MUST NOT let a raw column name leak into marts without deliberate renaming.
- MUST NOT create a mart that isn't consumed by at least one feature or planned feature — YAGNI applies to tables.
- MUST NOT skip
relationshipstests. If a FK exists conceptually, it must be declared. - MUST NOT use abbreviations outside the canonical vocabulary in naming-conventions.md.
- MUST NOT mix multiple entity levels (kommune + fylke + nasjon) in the same mart column without an explicit
levelcolumn. - MUST NOT commit without running
dbt runanddbt test— the "it's a trivial change" excuse is the start of every 1800-table mess. - MUST NOT hand-edit a past migration file. Add a new one.
- MUST NOT expose
raw.*to any consumer. The public contract ismarts.*only. - MUST NOT create per-team or per-source variants of
dim_*— dims are conformed. - MUST NOT leave
[TBD]placeholders in catalogue entries merged tomain— fill them in or remove the row.
Cross-references
- data-journey.md — end-to-end SSB 08764 walkthrough
- ingest-modules.md — ingest-side template (
index.tsshape, README structure, scraping convention) - dbt-osmosis.md — schema.yml description propagation
- check-osmosis.md — the description gate
- setup.md — dev environment
docs/stack/naming-conventions.md— canonical vocabularyatlas-data/ingest/src/sources/README.md— implemented-sources reference (per-source examples + planned-sources catalogue)