Investigate: Cloud agents that onboard new ingest sources autonomously
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Backlog
Goal: Set up an asynchronous cloud-agent pipeline that picks one candidate from INVESTIGATE-new-norwegian-public-sources.md, runs the source-onboarding workflow documented in website/docs/contributors/ingest-modules.md, and opens a PR — all without a human at the keyboard. Single agent at a time to keep the design simple: only one Cursor Background Agent runs at a time, eliminating concurrent-pick coordination from scope.
Last Updated: 2026-05-04 (rev 2: collapsed to single-agent / Cursor-only design — drops two-worker race-handling, migration-number-conflict logic, and parallel-PR INVESTIGATE-drift handling per user direction)
Origin: Atlas's catalogue grew from 21 to 38 sources in four days through tight human-driven loops (the user pastes a portal URL → Claude in the keyboard onboards it). INVESTIGATE-new-norwegian-public-sources.md currently lists 26 more candidates ready to ingest. The user has a Cursor ($25/mo) subscription with Background Agents that runs in cloud sandboxes — under-utilised on the user's machine, well-suited to async source onboarding. (The Claude Max $100/mo subscription stays at the keyboard for review + ingest work.) The user asked: "can we get an agent running in the cloud to read the INVESTIGATE-new-norwegian-public-sources.md and pick one from the list. create a feature branch. Then create the folder for it and write the code. Then write a PR for it." This investigation scopes that pipeline. Concurrency constraint (per user, 2026-05-04 rev 2): only one Cursor agent runs at a time — eliminates the entire double-pick / migration-collision / parallel-INVESTIGATE-drift problem class. Implementation is a follow-up PLAN-*.
What "agent in the cloud" needs to do
Per-candidate, the agent's contract is:
- Claim one entry from the candidate queue (race-safe — no double-claim).
- Branch from
mainasfeat/onboard-<source-id>. - Onboard the source per
contributors/ingest-modules.mdsteps 1–7:- Folder,
index.ts, prose-only README, manifest.yml (with hand-authoreddimensions:block), npm script, regenerate seed, refresh the reports INVESTIGATE.
- Folder,
- Pass quality gates locally inside the cloud sandbox: typecheck, vitest, fill-manifest-todos idempotency, build_sources_seed.py validation, dbt parse + osmosis check.
- Open a PR with
Closes #<issue>linkage and a tight description of what the agent did + what it couldn't do (any TODOs left for the human). - Stop — the agent does not run live ingest, does not apply migrations, does not merge.
Atlas's local Claude (i.e. me, paired with terje at the keyboard) takes over from there: review PR, run npm run migrate + npm run ingest:<source-id> against the live DB, verify row counts + upstream_updated_at, merge. That split keeps DB-write authority with the human, and keeps the agent's failure modes bounded to "open a PR that doesn't pass review."
Questions to Answer
- Issue queue: GitHub Issues is the natural primitive — one issue per candidate, agent picks the first unassigned one. With single-agent concurrency, no atomicity machinery is needed beyond
gh issue edit --add-assignee @me. Confirm this is the right shape vs alternatives (Project board, in-repo TODO file). - DB access: does the agent get a database connection to run live ingest, or does the human do that post-merge?
- Stuck-agent escalation: agent encounters auth-walled API / weird upstream / typecheck error it can't fix — how does it surface that without burning hours of compute?
- Cost: how much of the $25/mo Cursor budget does each candidate consume, and is that economical?
- Trigger model: agent runs on a schedule (cron-like), on-demand (user fires off a session), or on issue-creation (webhook)?
Current state
What we have today that this builds on:
- 38 ingested sources following a uniform pattern (
atlas-data/ingest/src/sources/<id>/{index.ts, README.md, manifest.yml}) + pairedatlas-data/migrations/NNN_raw_<id>.sql. - Bootstrap toolchain that's already mostly self-driving:
npm run sources:bootstrap-manifest -- <id>— skeleton manifestnpm run sources:fill-manifest-todos— auto-fills description, attribution, eu_theme, tags from the READMEcd atlas-data/dbt && uv run python scripts/build_sources_seed.py --readme— regenerates seeds + the auto-table
- Validation gates that already run dry (no DB):
npm run typecheck(TypeScript)npm test(vitest, 49 tests)build_sources_seed.pyvalidates required fields + EU-theme allowlist + dimensions shapeuv run dbt parse(dbt project well-formed)
- A documented contributor workflow —
contributors/ingest-modules.mdstep-by-step is the agent's runbook. - The candidate queue —
INVESTIGATE-new-norwegian-public-sources.mdholds 26 entries, each with URL, format, auth, licence, geo, cadence, provider, eu_theme, plus per-candidate quirks ([Q*] open questions).
What's missing:
- A claim mechanism preventing two agents from picking the same candidate.
- An agent runbook that says: read this file, do these steps, open a PR with this template.
- Wiring of one or both cloud-agent platforms (Cursor / Claude) to the queue.
- A post-merge runbook for the human (apply migration, run ingest, verify) — already implicit in the workflow but not collected as a checklist.
Proposed architecture
┌─────────────────────────────────────────┐
│ GitHub: atlas repo │
│ │
│ Issues (label: new-source) │
│ ├── #N: Onboard bufdir-barnefattigdom │ ◄── work queue
│ ├── #N+1: Onboard nav-uforetrygd │
│ ├── #N+2: Onboard ssb-10826 │
│ └── ... (26 issues) │
│ │
│ PRs (one per onboarding) │
└────────┬──────────────┬─────────────────┘
│ poll │ open PR
┌─────────┴──────────────┴────────┐
│ Cursor BG Agent (sandbox VM) │ ◄── single worker
│ │
│ - pick first unassigned issue │
│ - assign self │
│ - branch + write code │
│ - run quality gates │
│ - open PR │
│ - stop │
└─────────────────────────────────┘
▲
│ runbook
┌─────────┴──────────────────────────┐
│ AGENT-onboard-source.md (+ │ ◄── single source-of-truth
│ .cursor/rules/onboard-source.mdc) │
└────────────────────────────────────┘
▲
│ review + merge + run ingest
┌─────────┴────────────────────────────┐
│ Human + local Claude (this session) │ ◄── reviewer
└──────── ──────────────────────────────┘
Three pieces in scope: the queue, the worker, the runbook. Single-agent concurrency keeps every piece simple.
The queue: GitHub Issues with the new-source label
GitHub Issues is the right primitive: free, integrates with PRs (Closes #N), and visible to both human and agent through the same gh CLI. With one agent running at a time, the assignment is just a "this one is in flight, don't redo it on the next run" marker — no race protection needed.
Pick protocol:
1. gh issue list --state open --label new-source --search "no:assignee" \
--json number,title --limit 1
2. If empty → nothing to do; exit cleanly.
3. Otherwise pick that issue → ISSUE_NUM
4. gh issue edit ISSUE_NUM --add-assignee @me
5. Proceed with onboarding
(If a previous agent run was interrupted mid-flight without opening a PR, the issue stays assigned to the agent's handle but with no PR link. The human un-assigns it on next review pass and the agent re-picks it on next run. Self-correcting on a single-day cadence.)
One-time bootstrap: convert the 26 candidates in INVESTIGATE-new-norwegian-public-sources.md into 26 GitHub Issues. Suggested issue body template:
**Source**: <source-id> (e.g. nav-uforetrygd)
**Provider**: <provider-tag>
**Tier**: 1 / 2 / 3
**Reports unlocked / extended**: #5, #8 (per INVESTIGATE-reports-and-indicators)
**URL**: <upstream URL>
**Format / Auth / Licence**: <as captured in INVESTIGATE-new-norwegian-public-sources>
**Cadence**: <P1Y / P3M / P1M / etc.>
**Per-source quirks**: <any [Q*] notes from the backlog>
---
Open this issue with `gh issue create --label new-source --body @body.md`.
The agent runbook lives at [path]. The agent will assign itself, branch,
write code, run gates, open a PR with `Closes #<this>`, and stop.
A small one-off Python script can generate all 26 from the markdown source.
The worker: Cursor Background Agent (single instance)
Cursor BG Agents fit the workflow well: sandbox VM, full repo clone, GitHub-integrated PR flow, triggered on demand or on-schedule. Single instance — at any moment either zero or one agent is running. The Cursor $25/mo subscription is already paid and underused on the user's machine, so this is essentially free capacity.
Why not Claude Code Cloud as a second worker (initial design proposed both — collapsed in rev 2 per user direction): adding a second worker brings concurrent-pick coordination back into scope (sleep + re-check protocol, migration-number races, parallel INVESTIGATE-doc drift). The single-agent constraint deletes all of that. If throughput later becomes a bottleneck — at the projected 4–8 candidates per day per agent, the 26-candidate backlog clears in 3–6 calendar days, which doesn't seem to need parallelism — the Claude Code Cloud worker can join later by re-introducing the race-protected claim protocol from rev 1 of this doc.
The runbook: one markdown file the agent reads
Lives at website/docs/ai-developer/AGENT-onboard-source.md (loaded into the agent's context at session start) and at .cursor/rules/onboard-source.mdc for Cursor-specific config. Both reference each other so the source of truth doesn't fork. Contents:
- Claim protocol (above).
- Workflow steps: pointer to
contributors/ingest-modules.md+ clarifications:- Migration number =
MAX(existing migration numbers) + 1at branch time. - Source ID prefix matches the provider's tag (
bufdir-…,nav-…). - Author the
dimensions:block by reading the upstream's/queryendpoint and walking each dimension.
- Migration number =
- Quality gates the agent must pass:
npm run typecheckcleannpm testpassesnpm run sources:fill-manifest-todosis a no-op on second run (idempotency)cd atlas-data/dbt && uv run python scripts/build_sources_seed.py --readmevalidates and emits cleanlyuv run dbt parse(no DB needed) succeeds- All TODO markers are resolved in the manifest
- PR template — title
Add <source-id> — <one-line>; body sections: What landed, Cell budget, Filters chosen + why, Live-test commands the human runs (migrate + ingest + dbt seed), Known TODOs left. - Escalation: if any of {auth-walled API, opaque upstream slug, dimension fingerprint mismatches multiple tables, typecheck error after 3 attempts}: open the PR as draft, label it
needs-human, comment on the issue with the blocking question, and stop. The human triages. - What NOT to touch: never write to the database, never run
npm run migrate, never runnpm run ingest:*, nevergit push --force, never close issues directly (the merging human'sCloses #Ndoes that).
Coordination — what's not a problem (and what is)
The single-agent design eliminates every concurrent-PR coordination problem the rev-1 design had to handle: no double-pick (only one agent picks at a time), no migration-number conflicts (sequential), no parallel INVESTIGATE-doc drift (one PR per cycle merges before the next agent run starts). Documented here for completeness in case throughput pressure ever justifies a second worker.
What does need handling even with one agent:
- Stale claim from interrupted run: the agent crashes mid-flight after assigning itself but before opening a PR. Resolution: human-driven. On next review pass, an issue assigned to the agent for >24h with no linked PR gets un-assigned by the human; the agent re-picks it next run. No machinery; just hygiene.
- Bad PR: the agent opens a PR that fails review. Resolution: PR comments + close-don't-merge; the agent re-reads the runbook + the PR feedback in its next session. Standard PR review loop.
- Stuck mid-run (auth wall, opaque upstream slug, typecheck error after retries): the agent opens a draft PR, labels it
needs-human, comments on the issue with the blocker, and stops. The human triages.
DB access — the security boundary
Recommendation: agents do NOT have a database connection. They operate purely on the typecheck + dbt-parse + seed-validation gates, all of which run dry.
This keeps the failure mode of any cloud agent bounded: at worst, a bad PR. The human running npm run ingest:<source-id> after merge is the gate that touches raw.*. That's a 30-second extra step per merged source — completely worth it for the safety property.
A future optimisation could give agents a read-only role on a sandbox database to verify their dbt model parses without running it. Not on the critical path.
Cost & operational considerations
Per-candidate token estimate (from the keyboard work I've been doing): ~150k tokens for a typical FHI Ungdata-shape source, ~250k for a structurally novel source (KPR-1aar, Selvmord). Across 26 candidates: ~5M tokens. Cursor's BG-agent allocation in the $25 plan accommodates this comfortably.
Per-run wall-clock: ~10–15 minutes per candidate end-to-end.
Throughput at one agent: 4–8 candidates per day if the human reviews promptly. Full backlog (26 candidates) clears in ~3–6 calendar days of low human attention. Throughput is bounded by human review pace, not agent runtime — which is the right bottleneck for an unsupervised pipeline.
Failure cost: a stuck agent burns ~30 minutes of compute before it gives up and labels needs-human. Bounded; not expensive.
Open questions for decision
-
DB access for the agent: dry-only forever, or eventually a sandbox? Recommendation: dry-only for v1. Reconsider if dbt-model-parse coverage isn't catching schema mistakes that ingest would catch immediately.
-
Issue body content: just the candidate's URL + tier, or full embedded INVESTIGATE-new-norwegian-public-sources entry? Recommendation: full embedded (the [Q] questions matter for the agent's choices); the issue is self-contained so the agent doesn't need to re-read 492 lines of INVESTIGATE.*
-
needs-humanescalation: how does the agent communicate? PR comment, issue comment, or both? Recommendation: both; PR draft + issue comment withStuck: <reason>. Keeps the issue queue clean. -
Trigger model: how does the agent start a run? Cron-like schedule (e.g. once a day), on-demand (the user fires off a Cursor session), or webhook on issue-creation? Recommendation: on-demand for v1 — the user kicks off the agent when they want progress; review pace is the throughput bottleneck anyway. Schedule it later if review starts catching up.
-
Auto-merge: never. The PR review remains the human's quality gate. (Assert explicitly in the runbook; don't let the agent enable GitHub auto-merge.)
Sequencing recommendation
Phase 0 — pilot infrastructure (1–2 hours, blocking)
- Bootstrap script: convert the 26 candidates in
INVESTIGATE-new-norwegian-public-sources.md→ 26 GitHub Issues with labelnew-source. - Author the runbook at
website/docs/ai-developer/AGENT-onboard-source.md+.cursor/rules/onboard-source.mdc. - Configure the Cursor Background Agent against the runbook.
Phase 1 — pilot one end-to-end (1 candidate, agent-to-merge)
- Pick a low-risk candidate (suggestion:
ssb-10826bydel population — well-understood SSB shape, no auth, kommune-resolved). The agent claims it on its first run. - Watch the run. Read the PR. Address whatever the agent stumbled on (likely: dimension semantics, attribution string, eu_theme guess).
- Update the runbook based on the lessons.
- Merge. Run migrate + ingest + dbt test locally. Verify catalogue grew 38→39, dim count grew, the maintenance ritual fired correctly.
Phase 2 — drain the backlog (3–6 calendar days)
- Trigger the agent on demand (or on a daily schedule) — it picks the next unassigned issue, opens a PR, stops.
- Human reviews + merges + runs live ingest after each merge.
- Triage
needs-humanPRs as they appear — these are the candidates that have actual quirks (auth, weird upstream shape) and need keyboard-Claude pairing. - Capture per-candidate lessons in the runbook for future similar shapes.
Phase 3 — only if throughput pressure justifies it (defer)
- If 4–8 candidates/day proves too slow once review catches up, re-introduce the rev-1 race-protected claim protocol and add Claude Code Cloud as a second worker. The 26-candidate backlog at one-agent throughput is unlikely to need this.
What this investigation does NOT cover
- The PLAN that actually wires this up — separate
PLAN-008-cloud-agent-onboarding.mdonce decisions on the open questions land. - CI/CD for the agent's PRs — Atlas already runs typecheck + tests + dbt parse on PRs (or should). Confirming that's wired is a follow-up.
- Multi-repo agents — UIS coordination etc. is out of scope; this investigation is single-repo (atlas).
- Agents picking from non-onboarding queues — a generic Atlas agent that does anything labelled
agent-okis overscoped; one queue, one task type. - Deep ingestion of structurally novel sources — sources that need new ingest-lib code (a new HTML scraper, a new auth-flow client) are outside what an agent can do reliably; flag those as
needs-humanfrom the start. Only Tier-1-style FHI/SSB shape sources are agent-friendly. - Multi-agent parallelism — the single-agent constraint is a deliberate scope choice; reintroducing a second worker means resurrecting the rev-1 race-handling design (claim re-check, migration-number reservation, parallel INVESTIGATE-drift handling) and isn't justified at the 26-candidate backlog size.
Cross-references
- INVESTIGATE-new-norwegian-public-sources.md — the 26-candidate queue this pipeline drains.
- INVESTIGATE-reports-and-indicators-from-catalogue.md — the maintenance ritual the agent must execute (step 7 in
contributors/ingest-modules.md). - PLAN-007-data-display-open-by-default.md — the catalogue plumbing that every onboarded source feeds into.
website/docs/contributors/ingest-modules.md— the 7-step adding-a-source workflow that is the agent's runbook.feedback_reports-investigate-stays-current.md— the memory that codifies the maintenance ritual; the agent will be running into the same bar.- WORKTREE.md, GIT.md — multi-agent / multi-branch hygiene that becomes load-bearing once agents push concurrently.