Plan 001: Scraping infrastructure
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md — The implementation process
- PLANS.md — Plan structure and best practices
Status: Completed
Goal: Ship the shared scraping toolkit defined in INVESTIGATE-ngo-scraping-infrastructure.md — Crawlee dependency, two raw tables (raw.ingest_runs, raw.sitemap_log), the shared TypeScript library under ingest/src/lib/scraping/ with test coverage, the minimal mart_ingest_health dbt view, env-var conventions, and the per-source folder convention. After this plan, per-NGO scrape PLANs (Folkehjelp first) can implement sources against a stable foundation without re-litigating infrastructure.
Last Updated: 2026-04-24
Completed: 2026-04-24 — all 6 phases done in one session. Crawlee + fast-json-stable-stringify deps added; raw.ingest_runs and raw.sitemap_log migrations (023 + 024) applied with the partial unique index enforcing the concurrent-run lock; shared library under src/lib/scraping/ with 8 modules (ua, record_hash, html_raw_hash, robots, sitemap_log, ingest_runs, upsert_record, kv) and 49 pure-function tests; mart_ingest_health shipped as a 3-column dbt view with 5 passing data tests; README, naming-conventions, and CONTRIBUTING all cross-referenced. Final gates: npm run typecheck clean, npm test 49/49, npm run migrate idempotent, dbt build PASS=526 WARN=19 ERROR=0 TOTAL=545.
Investigation: INVESTIGATE-ngo-scraping-infrastructure.md — 25 resolved Q's, zero open. Prerequisites: none (all foundation already in place — Postgres, dbt, ingest repo, migrate runner). Blocks: INVESTIGATE-folkehjelp-supply.md's scrape PLAN — the first per-NGO consumer of this infrastructure. Priority: Medium
Overview
Six phases, estimated ~10–12 h. The investigation's 6–8h estimate pre-dates Q19/Q20/Q22/Q23/Q25, which added sitemap_log, mandatory raw columns, the concurrent-run lock, the test suite, and the file-responsibilities convention. The toolkit surface roughly doubled; adjust accordingly.
Built in PLAN-001:
- Dependencies:
crawlee,fast-json-stable-stringifyadded toatlas-data/ingest/package.json. Dev deps:vitest. - Migrations:
023_raw_ingest_runs.sql,024_raw_sitemap_log.sql. - Shared library under
atlas-data/ingest/src/lib/scraping/:ua.ts— env-driven UA builder, hard-fail on missingATLAS_SCRAPE_CONTACT_EMAIL.record_hash.ts— canonical JSON + NFC + sha256 hasher.html_raw_hash.ts— body-level hasher (audit-only).robots.ts—robots.txtfetcher and per-URL allow/deny verifier.sitemap_log.ts— sitemap-log reader/writer, orphan detection, fetch-skip decisions.ingest_runs.ts— run-lifecycle writer with concurrent-run lock.upsert_record.ts— genericupsertRecord()helper for the §C.5 mandatory columns.kv.ts— thin wrapper around Crawlee'sKeyValueStorefor cache reads/writes.index.ts— module re-exports.
- Tests (
vitest):__tests__/folder co-located undersrc/lib/scraping/.- Pure-function coverage: hashers, UA builder, robots parser,
decideFetchreason codes,upsertRecordinput validation. DB-touching code is verified end-to-end (see Implementation Notes).
- dbt model:
mart_ingest_health(3-column minimal view per Q14). .gitignoreentry for.crawlee-cache/underatlas-data/ingest/.- Documentation:
- New
atlas-data/ingest/README.mdlisting the three env vars (§F). - New
atlas-data/ingest/src/sources/README.mddocumenting the per-source folder convention (§B.3). - Extended
docs/stack/naming-conventions.mdwithsource_slug,record_hash,html_raw_hash,url,raw.ingest_runs,raw.sitemap_log.
- New
NOT built in PLAN-001:
- Any per-NGO scraper (Folkehjelp, NKS, Frelsesarmeen, …) — each is a follow-up PLAN citing this infrastructure.
- The
raw.html_archiveupgrade-path table from §C.1.2 — out of scope for v1 per the investigation. - Dagster integration — Dagster is a separate platform-service initiative; this PLAN's app-layer lock is designed to coexist with it (see §E.3.1 of the investigation).
- CI pipeline changes — the tests run locally via
npm test; integrating with the repo's existing CI is a separate small PR if warranted.
Phase 1: Dependencies and scaffolding — DONE
Tasks
- 1.1 Install runtime deps:
crawlee ^3.16.0,fast-json-stable-stringify ^2.1.0. Major-pinned via caret per existing project convention. - 1.2 Install dev deps:
vitest ^4.1.5. - 1.3 Added
"test": "vitest run --passWithNoTests"and"test:watch": "vitest"topackage.jsonscripts. The--passWithNoTestsflag makes the empty-suite interim state (pre-Phase 3) a clean exit-0 instead of exit-1. - 1.4 Created
atlas-data/ingest/vitest.config.tswith the include glob from the PLAN. - 1.5 Created
src/lib/scraping/+__tests__/subfolder +index.tsstub. Stub contains only a module-level comment listing the Phase 3/4 exports to land here +export {};to keep TypeScript happy until real exports land. - 1.6 Added
.crawlee-cache/toatlas-data/ingest/.gitignore(file already existed; appended the new entry). - 1.7 Extended the existing
atlas-data/ingest/README.mdwith an "Environment variables" section (inserted between "Install" and "Run one source"). Deviation from plan: the PLAN said "create" but the README already existed with substantive content describing the current ingest flow — extending it was the correct action. Env-var table matches §F of the investigation, with a clarifying note that non-scraper modules (SSB, FHI, Brreg, Red Cross API) don't read these variables.
Validation
cd atlas-data/ingest
npm run typecheck
npm test # should run zero tests cleanly — "No test files found" is acceptable
User confirms phase is complete.
Phase 2: Migrations for raw.ingest_runs and raw.sitemap_log — DONE
Tasks
- 2.1 Created
023_raw_ingest_runs.sqlwith schema + partial unique indexraw_ingest_runs_one_inprogress_per_sourceon(source_slug) WHERE finished_at IS NULL. Madefinished_atandexit_codenullable so the row can be inserted at run-start withfinished_at = NULL(in-progress marker). - 2.2 Created
024_raw_sitemap_log.sqlwith composite PK(source_slug, url)and the sitemap_log schema from §C.2. Table comment notes that HTML-index sources also use this table withlastmod = NULL. - 2.3
npm run migrateapplied both (023 in 10ms, 024 in 4ms).file_count: 24total. - 2.4 Created new
atlas-data/dbt/models/shared/sources.ymlwith both tables. dbt now sees 23 sources (was 21);dbt parseclean.
Validation
cd atlas-data/ingest
npm run migrate
psql $DATABASE_URL -c "\d raw.ingest_runs"
psql $DATABASE_URL -c "\d raw.sitemap_log"
psql $DATABASE_URL -c "\di raw.raw_ingest_runs_one_inprogress_per_source"
Both tables exist, the partial unique index exists. User confirms phase is complete.
Phase 3: Shared library — UA, hashers, robots — DONE
Tasks
- 3.1
ua.ts— exportsbuildUserAgent()(function, not cached constant — lazy reads env so tests can exercise missing-env case without module-reset gymnastics) + a namedMissingContactEmailErrorclass. Throws on unset/empty/whitespace-only env, with a descriptive message citing §D.1. - 3.2
record_hash.ts— exportsrecordHash(record: unknown): string. Usesfast-json-stable-stringify+ node:crypto sha256. Explicit comment documents that NFC normalization is the caller's responsibility (Q21 contract). - 3.3
html_raw_hash.ts— exportshtmlRawHash(body)pluscanonicalizeHtmlBody(body)(exported for tests). Canonicalization: strip<head>...</head>, strip CSRF meta tags and nonce attributes (both quote styles), collapse whitespace runs. - 3.4
robots.ts— hand-rolled parser (per [P1S.Q2] — kept the dep list small; robots.txt syntax is simple). Supports User-agent blocks + Disallow + Allow + Crawl-Delay +*path wildcards +$end-anchor. Longest-match-wins semantics with Allow breaking ties. - 3.5 Tests for all four modules under
__tests__/. Three real bugs caught during test runs:- robots.ts:
*wildcard regex conversion was broken (the escape-then-un-escape dance never fired because*wasn't in the first regex-special class). Fixed — now escapes specials, then converts*→.*in a clean second pass. - record_hash test: initial test used
øwhich is atomic in Unicode (no canonical decomposition). Replaced withå(U+00E5 ↔ U+0061 U+030A) which actually has an NFC/NFD split. - html_raw_hash test: initial whitespace test expected identical output after canonicalizing inputs with different tag-adjacent whitespace. Tightened the test to cover the actual contract (inter-token whitespace runs collapse to single space).
- robots.ts:
- 3.6
src/lib/scraping/index.tsupdated with re-exports for all four modules.
Validation
cd atlas-data/ingest
npm run typecheck
npm test -- src/lib/scraping/__tests__/
All tests pass; npm run typecheck clean. User confirms.
Phase 4: Shared library — sitemap_log, ingest_runs, upsert helper, KV wrapper — DONE
Tasks
- 4.1
sitemap_log.ts—readPriorState, puredecideFetch(four-condition skip rule with NULL handling and reason codes),upsertDiscovered(uses the sharedupsert()bulk helper fromlib/postgres.ts),detectOrphans. - 4.2
ingest_runs.ts—startRunwith SELECT-then-INSERT check (DB-layer partial unique index from Phase 2 catches the concurrent-race edge case),finishRun,IngestInProgressErrorclass with clear recovery SQL in the message. Nosql.begin()transaction wrapper: the DB-layer index is the correctness guarantee, and wrapping in a transaction added adapter-compatibility surface without buying anything. - 4.3
upsert_record.ts—upsertRecord(sql, { tableName, row, columns })returns'inserted' | 'updated' | 'skipped'. SELECT currentrecord_hash→ compare → skip-or-upsert. Uses inlinesql.unsafe(template, params)for the single-row INSERT/UPDATE rather than delegating to the shared bulkupsert()helper — clearer for the single-row case and avoids the unnecessary bulk-value-helper detour. - 4.4
kv.ts—getSourceKv,setCached,getCachedBody,getCachedMetadata,hasCached. Thin CrawleeKeyValueStorewrapper scoped per source. Exercised end-to-end by the first per-source PLAN (Folkehjelp); no isolated unit tests (would mostly test that Crawlee works). - 4.5 Pure tests only:
sitemap_log.test.tscovers everydecideFetchreason code,upsert_record.test.tscovers input-validation branches. DB-touching behavior (readPriorState,upsertDiscovered,detectOrphans,startRun,finishRun, the upsertRecord DB path) is not unit-tested: at Atlas's scale the code is short and mostly delegates topostgres.js, so unit tests against a mocked DB would mostly test thatpostgres.jsworks. Full verification happens through Phase 2's migration run, Phase 5'sdbt build, and the first per-source PLAN's end-to-end smoke test. If a real bug escapes, the right next step is a discrete "DB integration test harness" PLAN using testcontainers-postgres or similar.- Running: 49 passing, 0 skipped, 0 failing across 6 test files.
- 4.6
index.tsre-exportssitemap_log,ingest_runs,upsert_record, andkvpublic surfaces.
Validation
cd atlas-data/ingest
npm run typecheck
npm test
All tests pass. npm run typecheck clean. User confirms.
Phase 5: mart_ingest_health dbt model — DONE
Tasks
- 5.1
mart_ingest_health.sqlcreated as a view materialized inmarts.*. Selectsdistinct on (source_slug)ordered byfinished_at descwith awhere finished_at is not nullclause so in-progress rows never leak through. SQL comment at the top documents the empty-state expectation per [P1S.Q4]. - 5.2 schema.yml entry added. Five data tests:
not_null+uniqueonsource_slug,not_nullonlast_run_at,not_null+accepted_values: ['ok', 'fail']onlast_status. Went slightly beyond the PLAN's "+3" estimate to cover presence of all three columns, not just three assertions. - 5.3 ERD regenerated via
atlas-data/dbt/regenerate-erd.sh.docs/stack/erd.mdnow containsMODEL.ATLAS.MART_INGEST_HEALTHwith all three columns. ERD entity count: 36 → 37. - 5.4
dbt buildclean: PASS=526 WARN=19 ERROR=0 SKIP=0 TOTAL=545 (was 520 baseline from PLAN-002 + 1 model + 5 tests = 526 ✓).dbt show --inline "select * from marts.mart_ingest_health"returns empty — expected.
Validation
cd atlas-data/dbt
uv run --env-file ../ingest/.env dbt build
uv run --env-file ../ingest/.env dbt show --inline "select * from marts.mart_ingest_health"
Empty result is expected (no scrapers have written to raw.ingest_runs yet). dbt build passes. User confirms.
Phase 6: Per-source documentation and wrap-up — DONE
Tasks
- 6.1 Extended the existing
atlas-data/ingest/src/sources/README.mdwith a new "Scraping sources — additional convention" section covering the extended folder layout (discover.ts/parse.ts/overrides.json/types.ts/__tests__/fixtures/), file responsibilities (Q25), the §C.5 mandatory raw-table columns, migration naming, env vars, and a 7-step new-scraper checklist. Also added the missingredcross-branchesrow to the implemented-sources table. - 6.2 Extended
docs/stack/naming-conventions.mdwith a new "Raw-schema scraper conventions" section coveringsource_slug,url,record_hash,html_raw_hash,is_active,lastmod, and the two shared cross-source tables. Noted on the existingsource_idcanonical row that it's renamed fromraw.*.source_slugat the dbt passthrough — which caught a small inconsistency inmart_ingest_healththat was then fixed (the mart now exposessource_id, renamed at the view boundary). - 6.3 Added a scraper-specific sub-block to the "Prerequisites reading" list in
atlas-data/CONTRIBUTING.mdpointing at the investigation, the sources/README scraper section, and this PLAN's completed copy. - 6.4 All four gates green:
npm run typecheck— cleannpm test— 49 passing, 0 skipped, 0 failing (487ms)npm run migrate— idempotent (24 files, no-op on re-run)dbt build— PASS=526 WARN=19 ERROR=0 SKIP=0 TOTAL=545
- 6.5 Plan moved to
plans/completed/; status set toCompletedwith completion date and one-line summary.
Validation
All gates pass. User confirms by reading the three new README/docs files and agreeing they're useful entry points for the next per-source PLAN.
Acceptance Criteria
-
npm run typecheckclean inatlas-data/ingest/. -
npm testpasses (vitest; shared-lib pure tests). -
npm run migrateapplies both new migrations cleanly (and is idempotent on re-run). -
dbt buildpasses;marts.mart_ingest_healthexists and is queryable (empty until first scraper runs). - The partial unique index on
raw.ingest_runs (source_slug) WHERE finished_at IS NULLexists and rejects concurrent inserts. - Three env vars are documented in
atlas-data/ingest/README.md. Running any script withATLAS_SCRAPE_CONTACT_EMAILunset yields a clear startup error. -
.crawlee-cache/is gitignored underatlas-data/ingest/. - Per-source folder convention is documented in
atlas-data/ingest/src/sources/README.md. -
docs/stack/naming-conventions.mdcovers the new columns and tables. - No per-NGO scraper is implemented as part of this PLAN (that's a follow-up).
Implementation Notes
- Postgres client: the existing
ingest/src/lib/postgres.tsmodule presumably exports a client factory; reuse it for the sitemap_log / ingest_runs / upsert code rather than constructing new clients. Match the existing convention (pooled vs single connection, transaction style). - NFC normalization (Q21): callers of
recordHashare responsible for normalizing strings at the parser boundary.record_hash.tsitself does not normalize — if it did, the record object in memory would still contain non-NFC strings and could diverge from what's written to the DB. Put the normalization inparse.tsfor each source, tested via the record_hash unit tests. - Partial unique index (Q22): the investigation describes the lock as an application-layer check, but backing it with a database-enforced partial unique index on
(source_slug) WHERE finished_at IS NULLmakes it race-free even if two processes run the SELECT simultaneously. The SELECT-then-informative-error code path still exists for good error messaging; the DB index is the correctness guarantee. - Crawlee version pinning: lock to a specific major version in
package.json(e.g.,"crawlee": "^3.x.y"). Crawlee has had breaking changes at major versions; unpinning risks silent regressions. - Test scope: unit tests cover the pure-function surface (hashers, UA builder, robots parser,
decideFetch, input validation). DB-touching code insitemap_log/ingest_runs/upsert_recordis short and delegates topostgres.js; it's verified end-to-end via Phase 2'snpm run migrate, Phase 5'sdbt build, and the first per-source PLAN (Folkehjelp) smoke test. If a real bug escapes, add a discrete "DB integration test harness" PLAN (likely testcontainers-postgres) rather than mocking the DB. - dbt sources.yml placement (Phase 2.4): the investigation doesn't specify, and PLAN-002 put
redcross_branchesunder thesupply/folder's sources. Whetheringest_runsandsitemap_logbelong undersupply/sources.ymlor a newshared/sources.ymlis a small call — they're infrastructure, not supply. Pick whichever reads more naturally;shared/feels slightly better. - Crawlee
KeyValueStoreinit: Crawlee readsCRAWLEE_STORAGE_DIRon first import and caches the path. Make sure the env var is set before the firstimport— typically viadotenvat process start or via--env-file=.envon the tsx command (existing convention inpackage.jsonscripts already does this).
Files to Modify
New files:
atlas-data/ingest/src/lib/scraping/ua.tsatlas-data/ingest/src/lib/scraping/record_hash.tsatlas-data/ingest/src/lib/scraping/html_raw_hash.tsatlas-data/ingest/src/lib/scraping/robots.tsatlas-data/ingest/src/lib/scraping/sitemap_log.tsatlas-data/ingest/src/lib/scraping/ingest_runs.tsatlas-data/ingest/src/lib/scraping/upsert_record.tsatlas-data/ingest/src/lib/scraping/kv.tsatlas-data/ingest/src/lib/scraping/index.tsatlas-data/ingest/src/lib/scraping/__tests__/ua.test.tsatlas-data/ingest/src/lib/scraping/__tests__/record_hash.test.tsatlas-data/ingest/src/lib/scraping/__tests__/html_raw_hash.test.tsatlas-data/ingest/src/lib/scraping/__tests__/robots.test.tsatlas-data/ingest/src/lib/scraping/__tests__/sitemap_log.test.ts—decideFetchpure-logic testsatlas-data/ingest/src/lib/scraping/__tests__/upsert_record.test.ts— input-validation testsatlas-data/ingest/vitest.config.tsatlas-data/ingest/README.mdatlas-data/ingest/src/sources/README.mdatlas-data/migrations/023_raw_ingest_runs.sqlatlas-data/migrations/024_raw_sitemap_log.sqlatlas-data/dbt/models/marts/mart_ingest_health.sql
Modified files:
atlas-data/ingest/package.json— new deps, new scripts.atlas-data/ingest/.gitignore— add.crawlee-cache/(create file if absent).atlas-data/dbt/models/marts/schema.yml— entry formart_ingest_health.atlas-data/dbt/models/shared/sources.yml— entries forraw.ingest_runsandraw.sitemap_log(new folder per [P1S.Q1]).atlas-data/CONTRIBUTING.md— cross-reference to this PLAN and its investigation.docs/stack/naming-conventions.md— new sections per §6.2.docs/stack/erd.md— addmart_ingest_health,raw.ingest_runs,raw.sitemap_log.
Decision-points specific to PLAN-001-scraping (per PLANS.md)
All four items were implementation-level choices; all resolved before handing the plan to implementation.
[P1S.Q1]sources.ymlplacement for the shared tables → newatlas-data/dbt/models/shared/sources.yml. Infrastructure tables aren't supply or indicators; they get their own folder. Decided 2026-04-24.[P1S.Q2]robots-parserdependency vs hand-rolled parser → check npm for active maintenance and bundle size during Phase 3.4; prefer the dep if healthy, hand-roll (~30 lines of regex) if stale. Implementer's call. Decided 2026-04-24.[P1S.Q3]DB integration test harness → not pursued in this PLAN. At Atlas's scale the DB-touching shared-lib code is short and mostly delegates topostgres.js; unit tests against a mocked DB would mostly test thatpostgres.jsworks. The DB path is verified end-to-end via Phase 2 migrate + Phase 5 dbt build + the first per-source PLAN's smoke test. Revisit with a discrete PLAN (likely testcontainers-postgres) if real bugs escape that coverage. Decided 2026-04-24.[P1S.Q4]Seedraw.ingest_runswith a dummy row → no. Emptymart_ingest_healthis the correct reflection of "no scrapers have run yet" and seed rows tend to become permanent mystery fixtures. Instead, add a SQL comment at the top ofmart_ingest_health.sqlexplaining that empty output is expected (see Phase 5.1). Decided 2026-04-24.