ruralitic-qrm/data/processed/proportions_CA_table.md

65 lines
5.8 KiB
Markdown
Raw Normal View History

2026-06-02 10:45:48 +02:00
# `proportions_CA_table.csv`
The input table to the `proportions_CA` correspondence analysis (Swedish municipalities, 2022 cross-section). One row per municipality, one column per variable. This is the table used for the CA. Values have already been normalised (see "How it was built" below). Companion file `proportions_CA_table_columns.csv` lists, for every column, its **role** (active or supplementary) and its **block**.
## Rows
290 Swedish municipalities, identified by `code` (4-digit SCB code, zero-padded) and `municipality` name.
## Columns
Every other column is a variable used in the CA. Variables are grouped into eight thematic *blocks*. Six are **active** (they shape the CA axes); two are **supplementary** (projected onto the axes for interpretation but not used to build them).
### Active blocks (6)
| Block | Variables (n) | Content |
| -------------- | ------------- | ----------------------------------------------------------------------------------------------- |
| `education` | 4 | Four levels of educational attainment, primary/lower-secondary through post-graduate. |
| `employment` | 16 | 16 activity sectors (NACE-style), agriculture through arts & recreation. |
| `housing` | 3 | Rented, tenant-owned, and owner-occupied dwellings (sum across building types). |
| `workplace` | 3 | Commuters in, commuters out, working and living in the same municipality. |
| `migration` | 2 | Inmigrations, outmigrations. |
| `demography` | 2 | Number of retirees, number of localities. Pooled into one block so block normalisation can run. |
2026-06-02 12:25:13 +02:00
#### A note on the demography block
Pooling *retirees* and *localities* is partly a technical workaround. Block normalisation needs at least two variables in a block, otherwise the rescaling collapses every row to the same constant and the variable carries no information.
Both variables describe **how the population is distributed**: retirees say something about *who lives there* (ageing concentration), localities say something about *where they live* (how many separate settlements the municipality has). The contrast the block ends up encoding—retirees relative to localities—is in effect "people-per-settlement vs spread-thin-across-settlements". A municipality with many retirees per locality reads as an ageing population concentrated in a few settlements; one with many localities per retiree reads as a population spread thinly across small ones. That contrast lines up with the urbanrural gradient the analysis is built to detect.
2026-06-02 10:45:48 +02:00
### Supplementary blocks (2)
2026-06-02 12:25:13 +02:00
| Block | Variables (n) | What it captures |
2026-06-02 10:45:48 +02:00
| ------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `provision` | up to 33 | Counts of educational institution units by type (preschool, primary, secondary, adult, HE) × {total, public, private}. Some columns with no observations anywhere are dropped automatically. |
| `opinion` | 9 | Survey-based satisfaction with preschool, elementary school and high school (bad / mid / good shares). |
## How it was built
The pipeline lives in `src/municipalities/04-proportions_CA.R`. The table you see in this CSV is the result of three steps.
### 1. Backfilled 2022 cross-section
2026-06-02 12:25:13 +02:00
For every variable, the value used is the 2022 figure if available; otherwise the most recent earlier figure for that municipality (priority: 2020-2021 window, then census-closest, then discontinued-last).
2026-06-02 10:45:48 +02:00
### 2. Block normalisation
Inside each *active* block (and the opinion block), every municipality's row is rescaled so the block-total is the same constant (1000). After this step:
- every municipality contributes the same row mass to the CA (no size effect), and
- every block contributes the same total weight (no block dominates because its raw counts are bigger).
2026-06-02 12:25:13 +02:00
So the value in, say, the `Upper secondary` column for Upplands Väsby (435.2) reads as "435.2 of 1000 within the education block for that municipality"; i.e. ~43.5% of the municipality's educated population has attained upper-secondary level.
2026-06-02 10:45:48 +02:00
### 3. Provision rescaled as a per-capita rate
The supplementary `provision` columns are *not* block-normalised. Instead each count is divided by the row sum of the education block (a population proxy) and multiplied by 100 000, giving an "institutions per 100 000 inhabitants" rate. Block normalisation here would assign artificially high "shares" to rare institution types in small municipalities that happen to host one, throwing their supplementary projection far outside the active cloud.
### 4. Renaming, drop-empty
2026-06-02 12:25:13 +02:00
Columns are renamed to short readable labels (see `proportions_CA_table_columns.csv` for the mapping). Any column that is zero in every municipality (e.g. an institution type with no private units anywhere) is dropped.
2026-06-02 10:45:48 +02:00
## How to read a row
2026-06-02 12:25:13 +02:00
A row is a municipality's *profile* across all blocks. Within each active block the values sum to 1000 and can be read as per-mille shares; provision values are rates per 100 000 inhabitants; opinion values are also normalised to a per-1000 share within their preschool/elementary/highschool triples. Across blocks the values aren't comparable as raw numbers; each block is comparable to *itself* across municipalities, not to the other blocks.