PLAN: Install Dagster in atlas — polyglot code-location image
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Completed (2026-05-12) — image at ghcr.io/terchris/atlas-data
Goal: Build Atlas's Dagster code-location artefact — a single polyglot Docker image (Python + Node + dbt + dagster + dagster-pipes + the ingest TS + the dbt project) that, when registered as a Dagster code location, lets Dagster orchestrate every Atlas ingest as a Python @asset. End state of this PLAN: image built, pushed to ghcr.io/helpers-no/atlas-data:vX.Y.Z, one source (ssb-08764) Pipes-enabled and verified end-to-end against dagster dev locally.
Last Updated: 2026-05-12
Investigation: INVESTIGATE-deployment-pipeline.md — the Atlas view of the deploy. The Dagster architecture itself (Helm install, K8s topology, resource sizing, two-pod-types model, language-agnostic Pipes pattern) is owned by UIS in urbalurba-infrastructure/.../INVESTIGATE-dagster.md. This PLAN implements the Atlas-side counterpart of UIS PLAN-002 (image build + first-source-Pipes-enabled) without the UIS Helm-values-edit step — that's a separate follow-up PLAN that fires when UIS PLAN-001 (Dagster install) is live.
Sequencing: Atlas's contributor confirmed Atlas can ship this PLAN's work before UIS PLAN-001 lands. Everything in scope here can be validated locally via dagster dev — no live Dagster platform required. The dependency on UIS only fires for the code-location registration (next PLAN).
Prerequisites: Local Atlas Postgres reachable on localhost:35432 (UIS port-forward or kubectl port-forward svc/postgresql 35432:5432) for dagster dev + asset materialisation testing.
Problem Summary
Atlas's atlas-data/ingest/src/sources/ directory has 40+ TypeScript source modules that today are invoked manually via npm run ingest:<source>. No scheduling, no observability, no lineage, no freshness policies, no run history. dbt sits downstream as a separate manual command. The whole pipeline is one operator's terminal.
Per the deployment-pipeline INVESTIGATE and the UIS-side INVESTIGATE-dagster.md, the v1 orchestrator is Dagster — UIS deploys the platform (webserver, daemon, metadata DB), Atlas contributes a code-location image that Dagster talks to over gRPC. The image runs as two pod types: a long-lived "describe-only" code-location pod and ephemeral "execute" run pods (spawned per materialisation).
The Atlas-side work splits naturally into two units:
- Build the image; prove the pattern with one source. This PLAN.
- Roll out remaining 40+ sources, wire
dagster-dbt, add schedules, register as a real code location. A follow-up PLAN after UIS PLAN-001 ships.
Splitting at "image built + one source Pipes-enabled" because that's the natural validation point — once one source works end-to-end via dagster dev, the remaining sources are mechanical (~5 lines of Pipes integration each).
Phase 1: Python scaffolding under atlas-data/dagster/
Get a minimal Python package that boots dagster dev and shows an empty workspace. No assets yet; we just want the module to load cleanly.
Tasks
- 1.1 Create
atlas-data/dagster/with this layout:atlas-data/dagster/
├── pyproject.toml # Python package metadata + deps
├── atlas_data/
│ ├── __init__.py
│ ├── definitions.py # The Dagster entry point — `defs = Definitions(...)`
│ └── assets/
│ └── __init__.py
└── README.md # What this directory is, how to run dagster dev - 1.2
pyproject.tomlpins the Python-side deps. UIS doc namesdagster, dagster-dbt, dagster-k8s, dagster-pipes; for Phase 1 we only needdagsteritself (the others come in Phase 2 + future PLANs). Pin to dagster~1.13(Apr 2026 stable). Python>=3.11,<3.13(matches Atlas's dbt env). Manage withuv(Atlas's existing pattern — seeatlas-data/dbt/requirements.txt). - 1.3
definitions.pyis justfrom dagster import Definitions; defs = Definitions(assets=[]). Empty workspace. Critical discipline (per UIS doc): no DB connections at module scope, no eager I/O, no heavy imports. Every Dagster run pod cold-starts by importing this module; expensive imports → slow runs. - 1.4 Run
uv run dagster devfromatlas-data/dagster/. Verify:- Dagster webserver boots on
localhost:3000(or whatever port;--portto override since Docusaurus dev uses 3000). - Workspace shows the
atlas_datacode location with zero assets. - The UI renders cleanly.
- Dagster webserver boots on
- 1.5 Write
atlas-data/dagster/README.mdcovering: what this folder is, how to install (uv pip install -e .), how to run (uv run dagster dev), the cheap-import discipline.
Validation
cd atlas-data/dagster && uv run dagster dev boots, no errors, empty workspace visible in the UI. dagster definitions validate (or whatever the equivalent is in 1.13) passes.
Phase 2: Wire ssb-08764 via Dagster Pipes
Add the TypeScript Pipes SDK to ingest, wire one source's index.ts, declare the matching Python @asset, verify end-to-end materialisation.
Tasks
- 2.1 Add
@dagster-io/dagster-pipestoatlas-data/ingest/package.jsondependencies. Pin to a specific version (currently0.1.0; the UIS doc flags this is on a slower cadence than the Python SDK — explicit pin, deliberate bumps).npm install. - 2.2 Modify
atlas-data/ingest/src/sources/ssb-08764/index.ts:- Add
import * as dagster_pipes from '@dagster-io/dagster-pipes'at the top. - At the start of
run(), callusing context = dagster_pipes.openDagsterPipes()— this is a no-op when the Dagster env vars are absent, so localnpm run ingest:ssb-08764still works exactly as today. - After the ingest completes successfully, call
context.reportAssetMaterialization({ row_count, source_updated, fetch_duration_ms })with whatever metadata the source already tracks. - Verify local
npm run ingest:ssb-08764still works (no regressions).
- Add
- 2.3 Add
atlas-data/dagster/atlas_data/assets/raw_ssb.py:from dagster import asset, AssetExecutionContext
from dagster_pipes import PipesSubprocessClient
import os
@asset(key=["raw", "ssb_08764"], group_name="raw_ssb")
def raw_ssb_08764(
context: AssetExecutionContext,
pipes_subprocess_client: PipesSubprocessClient,
):
return pipes_subprocess_client.run(
command=["npm", "run", "ingest:ssb-08764"],
context=context,
cwd="/app/ingest", # path inside the polyglot image; for `dagster dev` locally, parameterise to atlas-data/ingest
env={"DATABASE_URL": os.environ["ATLAS_DATABASE_URL"]},
).get_materialize_result()- The
cwdis hardcoded to the in-image path; locally it needs to be<repo>/atlas-data/ingest. Pass via env var or compute from__file__. - Wire
PipesSubprocessClientas a resource indefinitions.py.
- The
- 2.4 Verify end-to-end via
dagster dev:- Start
dagster devfromatlas-data/dagster/. - In the UI, navigate to the
raw.ssb_08764asset and click "Materialize." - Dagster spawns the subprocess, which runs
npm run ingest:ssb-08764, which hits PostgREST/Postgres, upserts toraw.ssb_08764, callsreportAssetMaterialization. - Confirm: a row appears in Dagster's event log for the asset; row count and metadata visible; downstream
dbt runwould now see fresh data.
- Start
- 2.5 Confirm local
npm run ingest:ssb-08764still works (no-op Pipes when env vars absent). Test in both directions (Dagster path + manual path).
Validation
A dagster dev instance shows one asset (raw.ssb_08764). Clicking Materialize runs the existing TypeScript ingest as a subprocess, the database gets new rows, and the run-event-log records the materialisation with metadata. The same npm run ingest:ssb-08764 command works unchanged outside Dagster.
Phase 3: Polyglot Dockerfile + GHA image push
Wrap everything into the polyglot image and ship it to GHCR. Once UIS PLAN-001 is up, that image is what gets registered as a Dagster code location.
Tasks
- 3.1 Create
atlas-data/deploy/Dockerfile, multi-stage:- Stage 1 — node-deps:
node:20-slim, copyatlas-data/ingest/, runnpm ci(installs@dagster-io/dagster-pipes+ existing deps +tsx). - Stage 2 — python-deps:
python:3.11-slim, installuv,uv pip installdagster + dagster-dbt + dagster-k8s + dbt-postgres + the Pipes Python SDK. (We adddagster-dbtanddagster-k8shere even though Phase 2 doesn't use them — they're cheap to install and the next PLAN needs them. Avoid two image rebuilds.) - Stage 3 — runtime: combine. Copy node + npm modules + ingest source from stage 1, python + site-packages from stage 2, the
atlas-data/dagster/package, theatlas-data/dbt/project. - Entrypoint:
dagster api grpc --module-name atlas_data.definitions --port 4000(matches the UIS doc's expected gRPC server pattern). - Total expected image size: ~1.5–2 GiB (UIS doc estimate). Worth optimising later but not blocking.
- Stage 1 — node-deps:
- 3.2 Create
atlas-data/.dockerignore: excludenode_modules/(we re-install in-image),dbt/target/,dbt/dbt_packages/,output/,.env, anything else build-incidental. - 3.3 Build locally:
docker build -t atlas-data:dev atlas-data/. Verify:- Build succeeds.
docker run --rm -e ATLAS_DATABASE_URL=... atlas-data:devboots the gRPC server (listens on:4000, log says "Started Dagster code server").
- 3.4 New GHA workflow
.github/workflows/atlas-data-image.yml:- Triggers:
pull_request+pushtomain, path-filtered toatlas-data/**+ the workflow file. - PR: build the image (cache via
docker/build-push-action@v5+cache-from/cache-to: type=gha). No push. - Main push: build + push to
ghcr.io/helpers-no/atlas-data:sha-<short-commit>+:vYYYYMMDD-<sha-short>(date-prefixed for human readability; never:latestper UIS doc — Helm needs unique tags to roll). Permissions:contents: read, packages: write.
- Triggers:
- 3.5 Verify a push to
mainproduces an image in GHCR.docker pull ghcr.io/helpers-no/atlas-data:<tag>works from a fresh shell.
Validation
ghcr.io/helpers-no/atlas-data:<tag> exists and is pullable. docker run -e ATLAS_DATABASE_URL=... ghcr.io/helpers-no/atlas-data:<tag> boots and stays up. The image carries everything UIS needs to register it as a Dagster code location.
Acceptance Criteria
-
atlas-data/dagster/exists as a working Python package;uv run dagster devboots cleanly. -
definitions.pyimports cheaply (no DB connections, no eager I/O at module scope). - One source (
ssb-08764) materialises end-to-end via Dagster: TypeScript subprocess → Postgres write → asset materialisation event. - Local
npm run ingest:ssb-08764still works exactly as before (Pipes no-ops when env vars absent). -
atlas-data/deploy/Dockerfilebuilds a polyglot image that bootsdagster api grpc. - GHA workflow publishes the image to GHCR on main commits with unique tags.
Implementation Notes
Why we ship one source not 40+
Per the UIS doc's phasing (UIS PLAN-002 → Atlas registers one source first, UIS PLAN-003 → Atlas rolls out the rest), the validation cycle is much tighter at one source. Once one source works:
- The image structure is proven.
- The TypeScript Pipes integration pattern (5 lines per source) is proven.
definitions.pyimport-time cost is measurable.- The "no-op when Dagster env vars absent" property is verified, so local dev keeps working.
The remaining ~40 sources, plus dagster-dbt for the dbt half, plus schedules and freshness policies, become a separate PLAN that ships after this one. That PLAN is mostly mechanical — same Pipes pattern repeated per source — but is appropriately gated by "the pattern works at all."
Why definitions.py discipline matters
Per the UIS doc: every Dagster run pod cold-starts by importing atlas_data.definitions. Heavy module-scope code → slow runs.
Rules to bake in from day one:
- No DB connections at module scope (open them inside
@assetfunction bodies). - No expensive file I/O at module scope (no eager catalog loads, no parsing of big JSON files at import time).
- No environment-variable lookups that fail loudly (use
os.getenv()with defaults, notos.environ[...]). dagster-dbt's manifest parsing is acceptable — it's the one expensive thing the architecture needs.
Worth codifying as a Phase 4 / follow-up: a CI check that times python -c 'import atlas_data.definitions' and fails if it takes >2s.
Sequencing — Atlas-first or UIS-first?
Atlas's contributor explicitly authorised "Atlas can ship the dagster SW first" (see talk thread). This PLAN ships entirely without UIS PLAN-001 being live:
- Phase 1 + 2 validate against
dagster dev(a local Dagster instance that runs on the developer's machine, no UIS at all). - Phase 3 builds + pushes an image to GHCR. The image just sits there until UIS PLAN-001 is up.
The register-as-code-location step (UIS Helm values edit + helm upgrade) is a separate PLAN that fires whenever UIS PLAN-001 lands. If UIS PLAN-001 has already shipped by the time this PLAN finishes, that follow-up could be bundled in — but the work cleanly separates.
Image-size trade-offs
Polyglot Python + Node + dbt + dagster + uv venv → ~1.5–2 GiB on disk. Affects:
- Registry storage (modest).
- Image-pull time on first deploy to a fresh K8s node (a few minutes; subsequent pods reuse the cached image).
- Run-pod cold-start (file system cache is hot once first pod runs; subsequent pods boot in 5–15s).
Not blocking but worth keeping an eye on. Future optimisations: switch from python:3.11-slim to a distroless base, use uv pip install --no-compile, prune node_modules of dev-only deps. Defer until size becomes a real problem.
What this PLAN doesn't do (deferred)
- Register the image as a Dagster code location in UIS Helm values. A follow-up Atlas-side PLAN that depends on UIS PLAN-001 being live. Two-file change to UIS values +
helm upgrade. Could be a single co-ordinated PR with UIS at the time. - Roll out remaining 40+ sources via Pipes. Same ~5-line pattern as
ssb-08764. One PR per few sources (or bulk if there's appetite). Not gated on UIS; happens entirely in Atlas. - Wire
dagster-dbt. Auto-loads dbt models as Dagster assets frommanifest.json. Adds the marts side of the asset graph. Probably its own PLAN to keep this one small. - Schedules (
@scheduledeclarations per source). - Freshness policies + alerts.
- Run-pod-resource overrides per asset (for heavy ingest like
redcross-branches). max_concurrent_runsconfig indagster.yaml.
The UIS-side PLAN-003 doc enumerates several of these; from Atlas's side, each is a small follow-up after the first source proves the pattern.
Files to Modify / Create
Create:
atlas-data/dagster/pyproject.tomlatlas-data/dagster/atlas_data/__init__.pyatlas-data/dagster/atlas_data/definitions.pyatlas-data/dagster/atlas_data/assets/__init__.pyatlas-data/dagster/atlas_data/assets/raw_ssb.pyatlas-data/dagster/README.mdatlas-data/deploy/Dockerfileatlas-data/.dockerignore.github/workflows/atlas-data-image.yml
Modify:
atlas-data/ingest/package.json— add@dagster-io/dagster-pipesdep (pinned).atlas-data/ingest/package-lock.json— regenerated.atlas-data/ingest/src/sources/ssb-08764/index.ts— add Pipes calls.atlas-data/README.md— point at the newdagster/subdir; describe the polyglot image.
Do not touch:
urbalurba-infrastructure/**— Atlas-side only. Code-location registration is a follow-up coordinated with UIS.- Other ingest source modules — only
ssb-08764gets Pipes in this PLAN. atlas-data/dbt/— unchanged;dagster-dbtwiring is a follow-up.