Adding a new data source

This is the end-to-end workflow for adding a new upstream data source to Atlas — from cataloguing the source to making it queryable through the public API. Written for an external contributor who hasn't seen the codebase before; experienced contributors can skim.

If you've never seen Atlas's data pipeline before, read the data journey walkthrough first — it traces one source (SSB 08764) end-to-end so you understand what the pieces are. Then come back here for the procedural steps.

What "adding a source" means in Atlas

Atlas's data side is two layers:

Ingest — TypeScript modules under atlas-data/ingest/src/sources/ that fetch upstream data (SSB, FHI, Brreg, NGO HTML scrapes) and write to the raw.* Postgres schema. Verbatim shape — no renaming or reshaping.
dbt — SQL models under atlas-data/dbt/models/ that transform raw.* into marts.* (the internal data layer; the external public API is PostgREST against api_v1.* wrapper views — see api-v1.md).

Adding a source means both: write the ingest module that lands rows in raw.<source>, and the dbt model that maps raw.<source> to a clean per-source mart in marts.indicators__<source>.

A typical SSB source takes ~30 minutes for the ingest side, ~20 minutes for the dbt side. Scraping sources (HTML, no API) take longer — see ingest-modules.md § scraping convention.

Prerequisites

Before you start:

Read docs/stack/naming-conventions.md — the canonical vocabulary (kommune_nr, fylke_nr, orgnr, etc.). Atlas does not invent variants; rule #5 says every column must have a description in schema.yml (enforced by the check-osmosis gate).
Skim the data journey walkthrough for SSB 08764 if you haven't already.
Read atlas-data/ingest/src/sources/ssb-08764/ as the template — you'll copy and adapt this for SSB-style sources.
Have your dev environment set up — see setup.md.

If you're adding a scraping source (HTML, no API), the workflow below covers the API-source baseline; you also need:

INVESTIGATE-ngo-scraping-infrastructure.md — design rationale (Crawlee, robots, sitemap_log, record_hash)
ingest-modules.md § scraping convention — the extended folder layout (discover.ts, parse.ts, __tests__/fixtures/)
PLAN-001-scraping-infrastructure.md — what shipped in src/lib/scraping/

If you're covering a new NGO with Brreg data (legal-entity metadata), you do not add a new ingest source — the generic refresh:brreg-enheter already handles every NGO. Add a brreg_query block to the NGO's entry in landscape.json, then re-run npm run refresh:brreg-enheter. See ingest/src/lib/brreg/README.md.

The 11-step workflow

Execute these steps in order. Later steps depend on earlier ones — don't skip ahead.

Step 1 — Catalogue entry

File: docs/research/samfunnspuls/data-sources.md (or the broader docs/research/data-sources.md).

Required fields (minimum): id, provider, kind, title_no, what_it_is, use_cases (≥1), questions_answered (≥2), endpoint, auth, atlas_decision, verified_on.

Done when: entry exists and is syntactically valid per the schema file.

Step 2 — Investigate the upstream

Fetch upstream metadata and record in a scratch note (doesn't need committing):

Dimensions and sizes
Whether any dimensions require explicit selection (in SSB Klass: elimination=false)
Row count estimate for the default query
Update cadence
Region code format — bare 4-digit, prefixed (K_0301), or alphanumeric (030101a)?

Done when: a test curl or fetch returns valid data with the filters you plan to use.

Step 3 — Raw landing table migration

File: atlas-data/migrations/NNN_raw_<source_id_with_underscores>.sql (NNN = next free 3-digit number, zero-padded).

Required content:

create table if not exists raw.<source_id> — underscores in SQL identifiers, not hyphens
Columns match upstream shape — no renaming at this layer
Composite primary key on all dimension columns
loaded_at timestamptz not null default now()
comment on table describing the source in one sentence
comment on column for any non-obvious column (prefix codes, suppression markers, etc.)

Done when: npm run migrate applies the file without error.

Step 4 — Ingest module

Folder: atlas-data/ingest/src/sources/<source-id>/ — hyphens in folder names.

Two files: index.ts and README.md. The full template is in ingest-modules.md. Key constraints:

index.ts uses shared helpers from ../../lib/* (no inline writeNdjson, no inline Postgres client, no hard-coded credentials).
Row type declared inline; do not add it to lib/types.ts.
SOURCE_ID constant matches the catalogue id exactly.
README.md has all 9 required sections (see ingest-modules.md § README structure).

Done when: npm run typecheck passes and npm run ingest:<source-id> writes rows to raw.<source_id>.

Step 5 — npm script

File: atlas-data/ingest/package.json.

Add to the scripts block, alphabetically among the other ingest:* entries:

"ingest:<source-id>": "tsx --env-file=.env src/sources/<source-id>/index.ts"

Step 6 — Source-list entry

File: atlas-data/ingest/src/sources/README.md.

Add one row to the "Implemented sources" table, alphabetically. Include: link to per-source folder, provider, one-sentence description, the npm run command, an important quirk (<120 chars).

Step 7 — dbt source declaration

File: atlas-data/dbt/models/indicators/sources.yml (or dimensions/sources.yml for dimensions).

Add a tables[] entry under the raw source with: name, description (2–3 sentences), loaded_at_field: loaded_at, freshness block (warn_after + error_after), columns: with not_null tests on PK columns, accepted_values where the value set is bounded.

Step 8 — dbt per-source model

File: atlas-data/dbt/models/indicators/indicators__<source_id_with_underscores>.sql.

Required content:

{{ config(materialized='table', schema='marts') }} at top
select 'ssb-XXXXX'::text as source_id, as the first column — hard-coded literal matching the catalogue id
Canonical column names per naming-conventions.md — rename upstream names as needed
For sources with prefixed region codes (like ssb-06913), strip the prefix here
For mixed-level sources (kommune + fylke + nasjon rows), add a computed kommune_nr: case when length(region_code) = 4 then region_code end as kommune_nr
loaded_at as updated_at (never expose loaded_at in marts)

Done when: dbt run --select indicators__<source_id> succeeds.

Step 9 — dbt schema.yml entry

File: atlas-data/dbt/models/indicators/schema.yml.

Add a models[] entry with:

Model-level description
Per-column description for every column (MUST — see check-osmosis.md)
not_null on every PK column
accepted_values where the set is bounded
dbt_utils.accepted_range on year-like columns
dbt_utils.unique_combination_of_columns on the model-level PK
relationships test on kommune_nr → ref('dim_kommune') if present
relationships test on fylke_nr → ref('dim_fylke') if present
relationships test on orgnr → ref('dim_ngo') if present

Done when: dbt-osmosis yaml document --dry-run --check exits 0. See dbt-osmosis.md for the description-propagation flow.

Step 9b — Regenerate `api_v1` if your model is under `models/marts/api/`

If (and only if) your new model is a mart_<feature> view living under models/marts/api/ — i.e. you're adding to the public API surface — re-run the generator:

cd atlas-data/dbt
./regenerate-api-v1.sh

This updates atlas-data/dbt/api_v1_generated.sql and api_v1_state.json. Inspect the diff: a new CREATE OR REPLACE VIEW api_v1.<your_view> block plus per-column COMMENT ON COLUMN lines should appear. Also update atlas-data/dbt/tests/api_v1_rowcount_matches_marts.sql — add a union all line for the new view pair (the test is hand-maintained today; future iteration may auto-generate it).

If your model is in models/marts/ but NOT under models/marts/api/ (e.g. a fact or internal mart consumed by other dbt models), skip this step — that view is internal-only.

The drift gate at ./check-api-v1.sh enforces this; naming-conventions rule #9 codifies the convention. See api-v1.md for the layer's design rationale.

Step 10 — Verify end-to-end

Run, in order, all of:

cd atlas-data/ingest
npm run typecheck                                                     # must pass
npm run migrate                                                       # must succeed (idempotent)
npm run ingest:<source-id>                                            # must succeed
cd ../dbt
uv run --env-file ../ingest/.env dbt run  --select indicators__<source_id>   # must succeed
uv run --env-file ../ingest/.env dbt test --select indicators__<source_id>   # must pass 100%
./check-osmosis.sh                                                    # must pass
./check-api-v1.sh                                                     # must pass (only if step 9b applies)
./apply-api-v1.sh                                                     # only if step 9b applies; idempotent
uv run --env-file ../ingest/.env dbt test --select api_v1_descriptions_complete api_v1_rowcount_matches_marts   # only if step 9b applies

If any command fails or any test warns, fix before committing. No "we'll clean up later."

Step 11 — Commit

Commit message format: Add <source-id> (<one-line summary>).

Workflow: add a new `dim_*` (conformed dimension)

Similar to the source workflow, with these differences:

Step 8 puts the model in dbt/models/dimensions/, not indicators/.
The dim's source declaration in sources.yml lives in the same folder.
The model's schema.yml entry lives in dbt/models/dimensions/schema.yml.
Every existing indicator that references the dim MUST add a relationships: test in the same PR.
If the dim introduces a new canonical column name, update docs/stack/naming-conventions.md in the same PR.

Workflow: modify an existing source

Raw column changes: add a new migration file — never edit the existing one.
Model column changes: edit the model file directly; update schema.yml in the same change. If the model is under models/marts/api/, also re-run regenerate-api-v1.sh and commit the regenerated artefacts (rule #9 in naming-conventions.md).
Renaming a column: do it in the dbt passthrough; raw stays as-is. For api/ models, the rename surfaces in api_v1.<view> after regenerate — that's a breaking change for external consumers, treat carefully.
Retiring a source: add deprecation_date: YYYY-MM-DD to the model config; delete after the date.

Workflow: deprecate then remove a `mart_<feature>` (api_v1 view)

A two-phase process — the api_v1 wrapper has external consumers and can't be dropped abruptly. See api-v1.md § Deprecating then removing for the full flow.

Phase A — Deprecate. Add a meta: block to the model's schema.yml entry indicating deprecation:

- name: mart_old_view
  meta:
    deprecated_until: '2026-12-01'
    deprecated_reason: 'Replaced by mart_new_view; see PLAN-XXX'

Wrapper still serves traffic; consumers are notified via release notes. Grace period: at least one consumer-notice cycle.

Phase B — Remove. After grace period and confirmed no traffic: delete the model from models/marts/api/. Run ./regenerate-api-v1.sh — the generator notices the view is in api_v1_state.json but not in the current manifest, and emits DROP VIEW IF EXISTS api_v1.<name> CASCADE. Apply via ./apply-api-v1.sh. Remove the corresponding line from tests/api_v1_rowcount_matches_marts.sql.

PR checklist

Before opening a PR, verify every box. A reviewer (human or LLM) should reject a PR that fails any item.

Catalogue + upstream

Source has a catalogue entry with all required fields
atlas_decision is set (not evaluate_later for production work)
verified_on is today's date

Raw layer

Migration file added with sequential number
Raw table has primary key on dimension columns
Raw table has loaded_at timestamptz
Non-obvious columns have comment on column

Ingest

index.ts uses shared libs, not inline helpers
Row type declared inline, not in lib/types.ts
SOURCE_ID constant matches catalogue id exactly
All five PxWebAPI error paths considered (429, 5xx, schema change, rate limit, network)
README has all 9 required sections
npm run typecheck passes with zero errors

dbt

Housekeeping

npm script added and alphabetically sorted
ingest/src/sources/README.md updated
If new canonical vocabulary needed: docs/stack/naming-conventions.md updated
Commit message follows the format

Rules you MUST NOT break

MUST NOT let a raw column name leak into marts without deliberate renaming.
MUST NOT create a mart that isn't consumed by at least one feature or planned feature — YAGNI applies to tables.
MUST NOT skip relationships tests. If a FK exists conceptually, it must be declared.
MUST NOT use abbreviations outside the canonical vocabulary in naming-conventions.md.
MUST NOT mix multiple entity levels (kommune + fylke + nasjon) in the same mart column without an explicit level column.
MUST NOT commit without running dbt run and dbt test — the "it's a trivial change" excuse is the start of every 1800-table mess.
MUST NOT hand-edit a past migration file. Add a new one.
MUST NOT expose raw.* to any consumer. The public contract is marts.* only.
MUST NOT create per-team or per-source variants of dim_* — dims are conformed.
MUST NOT leave [TBD] placeholders in catalogue entries merged to main — fill them in or remove the row.

Cross-references

data-journey.md — end-to-end SSB 08764 walkthrough
ingest-modules.md — ingest-side template (index.ts shape, README structure, scraping convention)
dbt-osmosis.md — schema.yml description propagation
check-osmosis.md — the description gate
setup.md — dev environment
docs/stack/naming-conventions.md — canonical vocabulary
atlas-data/ingest/src/sources/README.md — implemented-sources reference (per-source examples + planned-sources catalogue)

What "adding a source" means in Atlas​

Prerequisites​

The 11-step workflow​

Step 1 — Catalogue entry​

Step 2 — Investigate the upstream​

Step 3 — Raw landing table migration​

Step 4 — Ingest module​

Step 5 — npm script​

Step 6 — Source-list entry​

Step 7 — dbt source declaration​

Step 8 — dbt per-source model​

Step 9 — dbt schema.yml entry​

Step 9b — Regenerate api_v1 if your model is under models/marts/api/​

Step 10 — Verify end-to-end​

Step 11 — Commit​

Workflow: add a new dim_* (conformed dimension)​

Workflow: modify an existing source​

Workflow: deprecate then remove a mart_<feature> (api_v1 view)​

PR checklist​

Catalogue + upstream​

Raw layer​

Ingest​

dbt​

Housekeeping​

Rules you MUST NOT break​

Cross-references​