Skip to main content

Investigate: Norsk Folkehjelp supply ingest

IMPLEMENTATION RULES: Before implementing any plan from this investigation, read and follow:

Status: Backlog

Goal: Decide how to ingest Norsk Folkehjelp's lokallag and per-chapter activities into Atlas. Unlike Red Cross, NF exposes no public API — sources are server-rendered HTML at folkehjelp.no/lokallag/{slug} plus the Brreg open Enhetsregister. Settle the NF-specific source surface, chapter-identity model (web vs Brreg), and the activity → service_category mapping for NF's 6-bin taxonomy.

Last Updated: 2026-04-24


Companion investigations

This investigation focuses only on what's specific to Norsk Folkehjelp. Generic concerns are addressed in:

Both companion infrastructure PLANs are prerequisites for the Folkehjelp scrape PLAN.


Scope

In scope:

  • The Folkehjelp source surface (HTML structure, robots.txt, sitemap, ethics) — concretely, not the generic doctrine.
  • NF-specific Craft GraphQL probe + outreach.
  • Chapter identity: reconciling Brreg orgnr with web slug; non-geographic chapters.
  • Activity extraction (CSS selectors) and 6 → 22 NF → Atlas category mapping.
  • NF-specific raw.folkehjelp_* schemas. Brreg-side data uses the shared cross-NGO raw.brreg_enheter (see §B.2 + PLAN-001-brreg-enheter), filtered by navn ILIKE 'Norsk Folkehjelp%'.

Out of scope (covered elsewhere — see Companion investigations above):

  • Generic scraper toolkit, cache, change detection, ethical defaults, failure modes.
  • The data-model extensions (source_url, chapter_subtype, chapter_kommune_coverage).
  • Sub-activity granularity from NF's minisites / localBranchEvents.
  • Coverage-gap mart consumption.
  • Folkehjelp's national-only programmes (minerydding, utviklingssamarbeid).

Existing research and live-data context

This investigation builds on three already-completed research notes — read these first:

  • norskfolkehjelp-activities.md — 6-bin activity catalogue, slug convention, Brreg-vs-web gap (108 web ↔ 121 Brreg), comparison to Red Cross. Sample of 46 chapter pages already parsed (42 with usable activity sections, 4 empty).
  • norskfolkehjelp-activity-indicator-matrix.md — already maps NF activities to demand-side indicators; reuse 100% from Red Cross.
  • ngo-landscape.md — high-level org info: NORSK FOLKEHJELP, orgnr 871033552, ~16 000 members, ~100 lokallag, NOK 1.8 bn income (2024).

Already in the live data (shipped by PLAN-001 + PLAN-002):

  • dim_ngo row for norskfolkehjelp (orgnr 871033552, slug norskfolkehjelp, chapter_data_shape='cms_bins').
  • ref_atlas_service_category — 22 cross-NGO categories.
  • dim_chapter, dim_activity, fact_chapter_activities — built for Red Cross. Folkehjelp's staging (supply__folkehjelp_*) will UNION ALL into the same downstream marts.
  • dim_postnummer + crosswalk_kommune_name — used by the slug → kommune resolution step.

Section A — Folkehjelp source surface

A.1 What folkehjelp.no exposes

Confirmed Craft CMS (verified 2026-04-23): robots.txt disallows /cpresources/, the canonical Craft asset directory — definitive evidence. See A.4 for the API consequences.

  • Index page: https://folkehjelp.no/lokallag — server-rendered HTML. Lists ~108 lokallag grouped under fylke headings.

  • Per-chapter page: https://folkehjelp.no/lokallag/{slug} — slug is kebab-cased kommune/region name with diacritic transliteration (asker-og-baerum, loerenskog, soer-varanger, aalesund-og-omegn). The "Aktivitetsområder" section is rendered from the Craft template; structure verified in Appendix B.

  • Sitemap index: https://folkehjelp.no/sitemaps-1-sitemap.xml — root sitemap-index pointing at 29 sub-sitemaps. Relevant ones for chapter discovery:

    • localBranch — chapter detail pages
    • localBranches — chapter index pages (likely fylke-level grouping)
    • localBranchSites — chapter sub-sites
    • localBranchEvents — per-chapter event listings (deferred to INVESTIGATE-ngo-events-and-minisites.md)
    • minisites — programme-specific landing pages (deferred)
    • Other sections (news, pages, countries, vacancies, donate, mainNavigation, etc.) are out of scope.

    Each sub-sitemap is XML with <loc> + <lastmod> per URL.

  • No structured data: chapter pages don't expose JSON-LD Organization / Place markup. Activity extraction is HTML selectors.

A.2 robots.txt — verified content

[Q1] Verified 2026-04-23: https://folkehjelp.no/robots.txt allows crawling under /lokallag/*. Actual contents:

User-agent: *
Disallow: /cpresources/
Disallow: /vendor/
Disallow: /.env
Disallow: /cache/
Disallow: /fagbevegelsen/1-mai
Disallow: /*?token*

User-agent: MJ12bot
Crawl-Delay: 15

User-agent: AhrefsBot
Crawl-Delay: 10

Sitemap: https://folkehjelp.no/sitemaps-1-sitemap.xml
Sitemap: https://www.npaid.org/sitemaps-1-sitemap.xml

Notes:

  • /lokallag/* is not in any disallow rule — clear to crawl.
  • npaid.org is Folkehjelp's English-language sister site (Norwegian People's Aid International). Out of scope for v1.
  • No bot-specific rule for our Atlas/0.1 UA — the wildcard rules apply, no Crawl-Delay enforced. The default 1 req/sec from the scraping infrastructure is fine.

A.3 What can break for NF specifically

  • Template change: Craft CMS upgrades or content-team rework can change the Aktivitetsområder section structure. Mitigation per the scraping infra failure-mode policy: warn-and-continue per chapter; the index page failing is what hard-fails the run.
  • Sub-activity drift: if NF adds a 7th activity bin to the CMS template, our staging mapping needs an update. Surface as a WARN: unmapped activity_label "X" — add to supply__folkehjelp_chapter_activities CASE.
  • NF-specific naming oddities: solidaritetsungdom-*, studentgruppe-*, sanitet-haukeland, sentralt, svalbard — handled via the override file (Section B.5).

A.4 Craft CMS GraphQL probe — [Q2] outreach worth pursuing

Per the scraping infra "ask before scrape" doctrine, Craft CMS ships with a built-in GraphQL API at /actions/graphql/api (often aliased to /api). It's opt-in.

Probe results 2026-04-23:

  • https://folkehjelp.no/api → 404
  • https://folkehjelp.no/actions/graphql/api → 404

So NF has not enabled the public GraphQL endpoint (or has routed it to a non-default path we haven't discovered).

Pre-PLAN action: write to NF (likely post@folkehjelp.no or their digital team) asking whether they would expose a read-only Public Schema scoped to localBranch / localBranchSites / activity entry types. Cost to them: ~30 min config; benefit to Atlas: massively cleaner data pipeline (typed queries, no HTML parsing, native fields for orgnr / contact / postal code, native change detection per entry).

Decision rule: the scrape PLAN does not block on the NF response. We design around the sitemap + HTML structure (per Sections B and C) so we can ship regardless. If NF later enables GraphQL, it becomes a separate refactor PLAN — raw.folkehjelp_* schema is independent of the source mechanism.


Section B — Chapter discovery and identity

B.1 Two sources of truth — [Q3] scrape both

SourceAuthoritative forAPI/access
Brreg Enhetsregisterorgnr, legal name, organisasjonsform=FLI, founding date, register flags (i_frivillighetsreg), employee countOpen API, no key. data.brreg.no/enhetsregisteret/api/enheter?navn=norsk+folkehjelp&organisasjonsform=FLI
folkehjelp.noPublic URL, display name, active activities, fylke groupingScrape. Sitemap + chapter pages.

Both required. Brreg gives us the legal-entity backbone (121 lokallag, 88% in Frivillighetsregisteret per the research); web gives us the URL + activities + active/inactive signal. Match on normalised name.

B.2 Brreg ingest — use the generic brreg-enheter source

Superseded by PLAN-001-brreg-enheter (2026-04-24). What this investigation originally proposed here — an NF-specific brreg-folkehjelp-units seed-source writing to an NF-specific raw.brreg_folkehjelp_units table — is not what shipped. During implementation the scope was pulled forward to cross-NGO from day one (see [Q4] below). For Folkehjelp, the concrete shape is:

  • Use the existing generic ingest at atlas-data/ingest/src/seed-sources/brreg-enheter/.
  • Run npm run refresh:brreg-enheter.
  • It reads landscape.json, iterates every NGO with a brreg_query block, fetches each via the shared typed Brreg client (src/lib/brreg/), and upserts all matches into the shared raw.brreg_enheter table.
  • Folkehjelp's brreg_query block (committed 2026-04-24) is { navn: "norsk folkehjelp", organisasjonsform: "FLI", nameStartsWith: "Norsk Folkehjelp" }. Verified: 122 rows, 108 in Frivillighetsregisteret — matches this investigation's baseline.

Adding a new NGO's Brreg data is a landscape.json edit, not a new script/migration. No NF-specific code path anywhere.

[Q4] Should this be brreg-folkehjelp-units (NF-specific) or a more general parameterised script? Recommendation was NF-specific for v1. Resolved 2026-04-24: generic from day one. The NF-specific recommendation in this investigation was reversed during PLAN-001 execution; the generic cross-NGO pattern shipped instead. See PLAN-001-brreg-enheter for the landed shape.

B.3 Sitemap-driven chapter discovery

Per the scraping infra sitemap-first doctrine:

// Stage 1: fetch the sitemap index
const indexXml = await fetchText('https://folkehjelp.no/sitemaps-1-sitemap.xml');
const subSitemaps = parseSitemapIndex(indexXml);
const localBranchSitemapUrl = subSitemaps.find(u => u.includes('localBranch'));

// Stage 2: fetch the localBranch sub-sitemap → all chapter URLs + lastmod
const branchXml = await fetchText(localBranchSitemapUrl);
const entries = parseSitemap(branchXml);
// each entry: { loc: 'https://folkehjelp.no/lokallag/asker-og-baerum', lastmod: '2026-03-15' }

// Stage 3: filter to /lokallag/* slugs (the sitemap may also include redirects/legacy URLs)
const chapterUrls = entries
.filter(e => e.loc.match(/^https:\/\/folkehjelp\.no\/lokallag\/[a-z0-9-]+$/))
.map(e => ({ url: e.loc, slug: e.loc.split('/').pop()!, lastmod: e.lastmod }));

Fylke grouping is not in the sitemap. We still fetch the HTML index page (/lokallag) once per scrape run to capture fylke→slug grouping for the chapter_kommune_coverage 'inferred' source. If the HTML index breaks but the sitemap still works, we degrade gracefully — chapters get ingested, regional coverage falls back to the override file.

Inherited from the scraping infra §C.2: discover.ts reads/writes the shared raw.sitemap_log table (one row per (source_slug, url) across all sources). This drives:

  • Skip-vs-fetch decision per URL (skip when lastmod_now <= stored_lastmod and a raw.folkehjelp_chapters row exists; otherwise fetch).
  • Orphan detection: any raw.sitemap_log row for source_slug='folkehjelp-chapters' whose last_seen_at is older than this run's started_at → mark the corresponding raw.folkehjelp_chapters row is_active=false (preserves history per scraping infra §E.1).

Folkehjelp's source-specific code does not implement sitemap-log logic — it calls the shared helpers from ingest/src/lib/scraping/.

B.4 Name normalisation and matching — [Q5] Brreg ↔ web

Normalisation function for both sides:

function norm(s: string): string {
return s
.toLowerCase()
.replace(/^norsk folkehjelp\s+/, '') // strip prefix
.replace(/[åä]/g, 'a').replace(/[øö]/g, 'o').replace(/æ/g, 'ae')
.replace(/[^a-z0-9 ]/g, '')
.replace(/\s+/g, '-')
.trim();
}

Match strategy (in order):

  1. Exact normalised match.
  2. Levenshtein distance ≤ 2 against unmatched candidates (handles minor typos; logs every fuzzy match for review).
  3. Manual override file (see B.5).

Expected result: ~108 of 121 Brreg rows match a web slug. ~13 Brreg-only rows are dormant or recently registered (per norskfolkehjelp-activities.md, one was registered 2024-11-07).

B.5 Manual override file — [Q6]

For names that don't match (Brreg uses formal long-form, web uses short display), a small JSON file in the source folder:

// atlas-data/ingest/src/sources/folkehjelp-chapters/overrides.json
{
"brreg_to_slug": {
"871234567": "asker-og-baerum", // Brreg orgnr → web slug
"871234568": null // explicitly Brreg-only (no web page)
},
"slug_to_kommune_nr": {
"asker-og-baerum": null, // multi-kommune; chapter_kommune_coverage holds the list
"solidaritetsungdom-bergen": "4601", // non-place slug; assigned manually
"svalbard": "2100", // Svalbard kommune code (not a real fylke)
"sentralt": null // umbrella entity, no kommune
},
"slug_to_chapter_subtype": {
"solidaritetsungdom-bergen": "youth-political",
"studentgruppe-blindern": "student",
"sanitet-haukeland": "hospital",
"sentralt": "umbrella"
}
}

The override file is committed (small, hand-curated). It's the escape hatch when the algorithmic match is wrong, the slug isn't a kommune name, or the chapter has a non-default chapter_subtype (vocabulary defined in INVESTIGATE-multi-ngo-supply-model-extensions.md).

B.6 Brreg-only rows — [Q7] what to do with the ~13 unmatched

For each Brreg row with no matching web slug:

  • Insert a dim_chapter row with is_active=false, source_url=NULL, chapter_orgnr set.
  • Activity rows: none.
  • These show up in coverage queries as "registered but dormant" — useful signal for funders, not a Coverage-gap supply contributor.

This matches the "is_active inheritance" pattern from PLAN-002 (Red Cross): chapters that exist in the source register but have no public activity page are inactive.

B.7 Coverage population for NF

Per INVESTIGATE-multi-ngo-supply-model-extensions.md, chapter_kommune_coverage rows for NF's 14 regional chapters get populated as source='inferred' from membership: union the kommune_nrs of each region's child lokallag. Implemented in supply__folkehjelp_chapter_kommune_coverage.sql as part of the scrape PLAN.


Section C — Activity extraction

C.1 The "Aktivitetsområder" section — selector

Verified 2026-04-23 by fetching https://folkehjelp.no/lokallag/asker-og-baerum — the section uses semantic HTML with no CSS classes:

<h3>Aktivitetsområder</h3>
<h4>Førstehjelp og redningstjeneste</h4>
<p>Som frivillig innen førstehjelp...</p>
<h4>Sanitetsungdom</h4>
<p>Sanitetsungdomsgruppene er...</p>
<!-- … -->

Parser walks siblings rather than querying a class:

const labels: string[] = [];
const heading = $('h3').filter((_, el) =>
$(el).text().trim() === 'Aktivitetsområder'
).first();

if (heading.length === 0) {
warn(`No 'Aktivitetsområder' heading found for ${slug}`);
} else {
heading.nextUntil('h2, h3').filter('h4').each((_, el) => {
labels.push($(el).text().trim());
});
}
// labels: ['Førstehjelp og redningstjeneste', 'Sanitetsungdom', ...]

Empty section = heading exists but labels.length === 0 → write zero activity rows per [Q8]. Missing heading altogether = WARN (template change signal).

C.2 The 6 NF activity bins — fixed vocabulary

NF uses a closed set of 6 areas across all chapter pages:

  1. Førstehjelp og redningstjeneste
  2. Sanitetsungdom
  3. Samfunnsarbeid
  4. Flyktning og inkludering
  5. Internasjonale spørsmål
  6. Solidaritetsungdom

The staging model maps these verbatim. New labels appearing in scraped HTML are a WARN (template / vocabulary change) — see scraping infra E.1.

C.3 Empty Aktivitetsområder section — [Q8] Resolved

4 of 46 sample pages had an empty section. Decision: chapter is is_active=true, just write zero rows in raw.folkehjelp_chapter_activities for that slug. The chapter exists, has a public page, and is registered — the absence of listed activities means "we don't have activity data" not "this chapter does nothing". Coverage-gap queries that filter by service_category_code will simply skip these chapters; queries that count chapters per kommune still include them.

C.4 Mapping NF bins → ref_atlas_service_category[Q9] three options

NF's 6 bins are coarser than Red Cross's 50 named activities. Each NF bin covers 1–N Atlas service categories. Three approaches considered:

Each NF bin maps to one dominant Atlas category. Lose visibility into the secondary categories an NF bin implicitly covers.

NF bin→ service_category_codeWhat we lose
Førstehjelp og redningstjenesterescue_corpsfirst_aid_standby, first_aid_training (NF chapters do these but we don't surface them per-chapter)
Sanitetsungdomyouth_activity_groups(none — direct match)
Samfunnsarbeidyouth_drop_in (closest)crisis_shelter, elderly_visiting (umbrella term collapses several specifics)
Flyktning og inkluderinglanguage_practicemigrant_mentoring, homework_help (folded into one bin)
Internasjonale spørsmålinternational_solidarity (NEW row in ref_atlas_service_category)
Solidaritetsungdomyouth_political_action (NEW row in ref_atlas_service_category)

Pros: clean cross-NGO sums (one fact row per bin; no double-counting). Honest about what the NGO actually publishes. Cleanest implementation.

Cons: chapters listed in "Førstehjelp og redningstjeneste" can't be retrieved by a service_category_code='first_aid_training' filter. We're representing the NGO's published shape, not the operational reality.

Note: Adding international_solidarity and youth_political_action to ref_atlas_service_category is itself a decision — these are NF-unique. Two new rows is justified by the threshold rule "category exists if 2+ NGOs offer it" only if Amnesty / Natur og Ungdom / Redd Barna also count. Keep the rows; document the threshold concession in the seed CSV's comment.

Option B — 1:N fan-out (rejected)

Each NF chapter listing "Førstehjelp og redningstjeneste" generates 3 fact rows (rescue_corps + first_aid_standby + first_aid_training).

Pros: more accurate to the chapter's operational scope.

Cons: inflates row counts (a 6-bin chapter generates 12+ fact rows). Cross-NGO counts become misleading: NF chapters look like they have 2× the supply they actually do. The chapter doesn't actually run all three sub-activities equally — we'd be inventing data.

Option C — "umbrella" categories (rejected)

Add a parallel umbrella code (first_aid_general) that covers rescue + standby + training. NF maps to umbrella; Red Cross maps to specifics.

Pros: NF gets a single mapping; Red Cross keeps detail.

Cons: cross-NGO joins become "first_aid_general OR rescue_corps OR first_aid_standby OR first_aid_training" — ugly. Vocabulary bloat. No clean meaning.

C.5 Decision — [Q10] Option A

Map 1:1, accept the loss, document it. Surface the verbatim NF bin label as dim_activity.canonical_name so users browsing per-NGO see "Førstehjelp og redningstjeneste"; cross-NGO filters use service_category_code='rescue_corps' and pick up the chapter via the strongest signal. Add international_solidarity and youth_political_action to ref_atlas_service_category.

The mapping table goes in Appendix A.


Section D — NF-specific raw schema

Generic conventions (column patterns, hashes, raw.ingest_runs, raw.sitemap_log) are in the scraping infra; below is the NF-specific shape.

Convention recap. Scraping infra §C.5 requires every scraper-sourced parent-entity raw table to carry: url, record_hash, html_raw_hash (audit-only, nullable), is_active, loaded_at. Of the three tables below:

  • raw.folkehjelp_chapters is scraper-sourced + parent → carries the §C.5 column set.
  • raw.folkehjelp_chapter_activities is a scraper-sourced child → does not carry §C.5 columns; deletes-and-reinserts when the parent's record_hash changes.
  • raw.brreg_enheter (shared, shipped by PLAN-001-brreg-enheter) is API-sourced (Brreg JSON API, not HTML scrape) → §C.5 explicitly does not apply; follows existing raw.<source> conventions. Not NF-specific — every NGO's Brreg data lands here.

D.1 raw.folkehjelp_chapters — scraper parent

-- migration NNN_raw_folkehjelp_chapters.sql (NNN allocated after the shared
-- raw.sitemap_log + raw.ingest_runs migrations from the infra PLAN)
CREATE SCHEMA IF NOT EXISTS raw;

CREATE TABLE raw.folkehjelp_chapters (
-- Source-specific identity
slug TEXT PRIMARY KEY, -- NF's stable kebab-cased identifier
display_name TEXT NOT NULL,
fylke_group_name TEXT NOT NULL, -- from index page grouping (NF-defined region)
brreg_orgnr TEXT, -- nullable until Brreg match found

-- §C.5 mandatory scraper columns
url TEXT NOT NULL UNIQUE, -- public URL, joins to raw.sitemap_log.url verbatim
record_hash TEXT NOT NULL, -- sha256 of canonical JSON of extracted record (skip signal)
html_raw_hash TEXT, -- sha256 of canonical HTML body (audit-only, nullable)
is_active BOOLEAN NOT NULL DEFAULT true,
loaded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX folkehjelp_chapters_brreg_orgnr_idx
ON raw.folkehjelp_chapters(brreg_orgnr) WHERE brreg_orgnr IS NOT NULL;

For Folkehjelp specifically, url always equals 'https://folkehjelp.no/lokallag/' || slug — but it lives as its own column per §C.5 (the join key against raw.sitemap_log must be verbatim what the sitemap published, not reconstructed).

D.2 raw.folkehjelp_chapter_activities — scraper child

Child of raw.folkehjelp_chapters. Per §C.5 child-table rule, no mandatory columns — children are owned by the parent and delete-and-reinserted when the parent's record_hash changes.

CREATE TABLE raw.folkehjelp_chapter_activities (
slug TEXT NOT NULL REFERENCES raw.folkehjelp_chapters(slug) ON DELETE CASCADE,
activity_label TEXT NOT NULL, -- one of the 6 NF bins, verbatim, NFC-normalised
display_order INT NOT NULL,
loaded_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (slug, activity_label)
);

activity_label must be UTF-8 NFC-normalised at the parser boundary per scraping infra §C.3 — cheerio sometimes returns Norwegian characters in NFD form, which would silently flip the parent's record_hash between runs.

D.3 raw.brreg_enheter — API-sourced, cross-NGO shared

Superseded shape landed 2026-04-24 via PLAN-001-brreg-enheter. What this investigation originally proposed — an NF-specific raw.brreg_folkehjelp_units table — shipped instead as a shared cross-NGO raw.brreg_enheter table. Same API source (Brreg Enhetsregister via data.brreg.no); same JSON-API-not-scrape classification (§C.5 scraper column conventions don't apply); but one table serves every NGO, discriminated by navn prefix (navn ILIKE 'Norsk Folkehjelp%' for NF's rows).

Final shipped shape in atlas-data/migrations/025_raw_brreg_enheter.sql:

CREATE TABLE raw.brreg_enheter (
orgnr TEXT PRIMARY KEY,
navn TEXT NOT NULL,
organisasjonsform TEXT NOT NULL, -- 'FLI' for NGO foreninger
registrert_dato DATE,
i_frivillighetsreg BOOLEAN NOT NULL DEFAULT false,
antall_ansatte INTEGER,
konkurs BOOLEAN NOT NULL DEFAULT false,
under_avvikling BOOLEAN NOT NULL DEFAULT false,
under_tvangsavvikling BOOLEAN NOT NULL DEFAULT false, -- underTvangsavviklingEllerTvangsopplosning
raw_payload JSONB NOT NULL, -- full Brreg Enhet entity
loaded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Three additions vs this investigation's original D.3: konkurs, under_avvikling, under_tvangsavvikling. They come verbatim from the Brreg Enhet response and replace any synthetic "is_active" we'd have had to build — Brreg owns the concept, we just carry the flags.

For PLAN-002 (Folkehjelp scrape): when joining raw chapters to Brreg, filter via where navn ilike 'Norsk Folkehjelp%' (or whichever prefix matches the NGO's brreg_query.nameStartsWith in landscape.json).

D.4 Source-specific scraper config

Per scraping infra §E.2, every source declares an absolute floor for the discovery URL count. NF's index has ~108 chapters today; the floor is set at ~50 (roughly half), tuned upward when the source consistently exceeds 100.

// ingest/src/sources/folkehjelp-chapters/index.ts
const MIN_DISCOVERED_URLS = 50;

D.5 Golden-file test fixtures

Per scraping infra §G.3, parse.ts is tested via golden-file fixtures. Target 2–3 fixtures spanning the page variants seen in the wild:

ingest/src/sources/folkehjelp-chapters/__tests__/
├── parse.test.ts
└── fixtures/
├── asker-og-baerum.html — typical chapter, all 6 activity bins listed
├── asker-og-baerum.expected.json
├── alta.html — empty Aktivitetsområder section (per [Q8])
├── alta.expected.json
├── solidaritetsungdom-bergen.html — non-geographic chapter
└── solidaritetsungdom-bergen.expected.json

Each test reads the fixture HTML, calls parse(html, url), deep-equals against the expected JSON. Template changes that legitimately alter the record shape update the HTML and the JSON together in the same commit.


Decisions resolved during planning

  1. [Q1] robots.txt verification — Resolved 2026-04-23: allows /lokallag/*. Full disallow list documented in A.2.
  2. [Q3] Scrape both Brreg + folkehjelp.no, match on normalised name. Resolved 2026-04-23.
  3. [Q7] ~13 Brreg-only rows → dim_chapter with is_active=false. Resolved 2026-04-23.
  4. [Q8] Empty Aktivitetsområder → is_active=true, zero activity rows. Resolved 2026-04-23.
  5. [Q9]/[Q10] Activity mapping: Option A (1:1, lossy). Add international_solidarity + youth_political_action to ref_atlas_service_category. Resolved 2026-04-23.

Open Questions

  1. [Q2] Outreach to NF asking for a read-only Public GraphQL schema (see A.4). Email proposed; track response. Scrape PLAN ships regardless; positive response would replace the parser with a thinner GraphQL client.
  2. [Q4] brreg-folkehjelp-units (NF-specific) vs parameterised. Resolved 2026-04-24 during PLAN-001-brreg-enheter implementation: generic from day one. See §B.2 above.
  3. [Q5] Levenshtein threshold for fuzzy name match (proposed: ≤ 2). Tune against actual data during PLAN.
  4. [Q6] Initial contents of the override file. Discoverable only by running the algorithmic match against real data.
  5. [Q11] Scrape cadence — manual / weekly / monthly? Recommendation: manual via npm script for v1. NF chapter changes are slow (founding dates suggest ~1–2 new chapters/year); weekly is overkill. Add a CronCreate-based schedule when Atlas has a job-runner.

Next Steps

Pre-PLAN — outreach to NF (parallel, non-blocking):

  • Send email to NF asking whether they would enable a read-only Public GraphQL schema for localBranch / localBranchSites / activity entry types. See A.4 for rationale. Track response; if positive, the scrape PLAN may be replaced by a thinner GraphQL-client PLAN.

Prerequisites (other investigations' PLANs that must ship first):

Folkehjelp-specific PLANs:

  • PLAN-001-brreg-enheter.mdshipped 2026-04-24. Generic cross-NGO Brreg ingest (shared raw.brreg_enheter table, refresh:brreg-enheter script, per-NGO query config in landscape.json). Folkehjelp row: 122 enheter, 108 in Frivillighetsregisteret. Scope pulled forward from the originally-proposed NF-specific "brreg-folkehjelp-units" to cross-NGO from day one (see §B.2 + [Q4]).
  • PLAN-002-folkehjelp-scrape-and-ingest.md (~7–10h)
    • Migration NNN_raw_folkehjelp_chapters.sql + raw.folkehjelp_chapter_activities carrying the §C.5 mandatory columns on the parent (D.1) and child shape from D.2.
    • Scraper at ingest/src/sources/folkehjelp-chapters/ following the per-source folder convention: index.ts, discover.ts, parse.ts, overrides.json, types.ts, README.md, __tests__/.
    • parse.ts is a pure function (html, url) → Record with NFC normalisation on every string field (scraping infra §C.3).
    • discover.ts calls the shared sitemap_log reader/writer; no per-source orphan logic.
    • index.ts calls the shared ua, record_hash, ingest_runs (with concurrent-run lock per §E.3.1), and upsertRecord() helpers from ingest/src/lib/scraping/.
    • MIN_DISCOVERED_URLS = 50 in index.ts (per D.4).
    • 2–3 golden-file test fixtures under __tests__/fixtures/ covering: typical chapter, empty-activities chapter, non-geographic chapter (per D.5).
    • dbt staging: supply__folkehjelp_chapters.sql, supply__folkehjelp_chapter_activities.sql, supply__folkehjelp_chapter_kommune_coverage.sql.
    • UNION ALL into dim_chapter, dim_activity, fact_chapter_activities, chapter_kommune_coverage.
    • Two new ref_atlas_service_category seed rows (international_solidarity, youth_political_action).
    • End-of-run validation: raw.ingest_runs row written; mart_ingest_health shows the source as ok; row counts match expected chapter total; dbt test pass.

Each PLAN ends with dbt run && dbt test per the always-run-tests rule in project-atlas.md.


Files this investigation will produce

Reused from earlier work:

  • dim_ngo — Folkehjelp row already seeded.
  • ref_atlas_service_category — extended by 2 rows (international_solidarity, youth_political_action).
  • dim_chapter, dim_activity, fact_chapter_activities, chapter_kommune_coverage — UNION ALL adds the Folkehjelp source.
  • dim_kommune, dim_postnummer, crosswalk_kommune_name — used during slug → kommune resolution.

Inherited from companion infrastructure (no new work in Folkehjelp PLANs):

  • raw.sitemap_log (shared, populated by Folkehjelp's discover.ts via shared lib)
  • raw.ingest_runs (shared, populated by Folkehjelp's index.ts via shared lib)
  • mart_ingest_health (shared)

New scraper-sourced tables (carry §C.5 mandatory columns where applicable):

  • raw.folkehjelp_chapters — parent, full §C.5 column set (PLAN-002)
  • raw.folkehjelp_chapter_activities — child, no §C.5 columns (PLAN-002)

Shared cross-NGO API-sourced tables (§C.5 not applicable):

  • raw.brreg_enheter — shipped by PLAN-001-brreg-enheter; Folkehjelp rows filter via navn ILIKE 'Norsk Folkehjelp%'.

New dbt models:

  • marts.supply__folkehjelp_chapters, marts.supply__folkehjelp_chapter_activities, marts.supply__folkehjelp_chapter_kommune_coverage (PLAN-002)

New ingest folders:

  • atlas-data/ingest/src/sources/folkehjelp-chapters/ — PLAN-002; new scraper with __tests__/fixtures/.
  • atlas-data/ingest/src/seed-sources/brreg-folkehjelp-units/not created; PLAN-001 shipped the generic src/seed-sources/brreg-enheter/ instead, driven by landscape.json.

Appendix A — NF activity → ref_atlas_service_category mapping (Option A, 1:1)

NF bin (verbatim)service_category_codedim_activity.canonical_nameNotes
Førstehjelp og redningstjenesterescue_corpsFørstehjelp og redningstjenesteMost distinctive of the 3 it covers (rescue + standby + training).
Sanitetsungdomyouth_activity_groupsSanitetsungdomDirect match with Red Cross RØFF.
Samfunnsarbeidyouth_drop_inSamfunnsarbeidClosest single match for the umbrella; lossy.
Flyktning og inkluderinglanguage_practiceFlyktning og inkluderingStrongest signal (språkkafé is the most common sub-activity per research).
Internasjonale spørsmålinternational_solidarityInternasjonale spørsmålNEW row in ref_atlas_service_category (PLAN-002).
Solidaritetsungdomyouth_political_actionSolidaritetsungdomNEW row in ref_atlas_service_category (PLAN-002).

Each chapter gets one fact_chapter_activities row per listed NF bin. dim_activity has 6 NF rows (one per bin), each with ngo_orgnr=871033552.


Appendix B — Verified chapter-page HTML structure

Verified 2026-04-23 by fetching https://folkehjelp.no/lokallag/asker-og-baerum. The Aktivitetsområder section uses semantic HTML with no CSS classes:

<h3>Aktivitetsområder</h3>

<h4>Førstehjelp og redningstjeneste</h4>
<p>Som frivillig innen førstehjelp og redningstjeneste...</p>

<h4>Sanitetsungdom</h4>
<p>Sanitetsungdomsgruppene er...</p>

<!-- repeats for each of the 0–6 activity bins listed -->

Implication: the parser cannot rely on a wrapping <section class="…"> selector. Instead, find the <h3> whose text equals "Aktivitetsområder" and walk forward through siblings (<h4> items) until the next <h2> or <h3> or end-of-document. See C.1 for the cheerio implementation.

If a future Craft template change wraps the section in a class, the parser still works — but a follow-up should add a class-based selector path that's more robust to non-Aktivitetsområder <h3> insertions.