ruralitic-qrm/txt/reports/proportions_CA_clusters.qmd

447 lines
24 KiB
Text
Raw Normal View History

2026-05-22 10:29:17 +02:00
---
title: "An empirical urban--rural typology of Swedish municipalities"
subtitle: "Correspondence analysis and hierarchical clustering on the 2022 sampling"
date: today
format:
pdf:
toc: true
number-sections: true
fig-pos: "H"
geometry: margin=2.5cm
fontsize: 11pt
execute:
echo: false
warning: false
message: false
---
```{r rootdir, include=FALSE}
knitr::opts_knit$set(root.dir = rprojroot::find_root(rprojroot::has_dir("data")))
```
```{r setup}
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(ggrepel)
library(knitr)
library(kableExtra)
library(showtext)
font_add_google("Source Sans 3", "source_sans_3")
showtext_auto()
theme_set(theme_minimal(base_size = 10, base_family = "source_sans_3"))
afc <- read_rds("data/processed/proportions_CA.rds")
hcpc <- read_rds("data/processed/proportions_HCPC.rds")
clusters <- read_csv("data/processed/cluster_assignment.csv", show_col_types = FALSE)
# County labels for context
county_names <- c(
"01" = "Stockholm", "03" = "Uppsala", "04" = "Södermanland",
"05" = "Östergötland", "06" = "Jönköping", "07" = "Kronoberg",
"08" = "Kalmar", "09" = "Gotland", "10" = "Blekinge",
"12" = "Skåne", "13" = "Halland", "14" = "Västra Götaland",
"17" = "Värmland", "18" = "Örebro", "19" = "Västmanland",
"20" = "Dalarna", "21" = "Gävleborg", "22" = "Västernorrland",
"23" = "Jämtland", "24" = "Västerbotten", "25" = "Norrbotten"
)
# Custom cluster labels and palette derived from the analysis below
cluster_labels <- c(
"1" = "Industrial-rural towns",
"2" = "Peri-rural commuter belt",
"3" = "Regional service centres",
"4" = "Metropolitan & knowledge hubs"
)
cluster_palette <- c(
"1" = "#D7A86E",
"2" = "#7CB07C",
"3" = "#6FA8DC",
"4" = "#C2738B"
)
clusters <- clusters |>
mutate(cluster_label = cluster_labels[as.character(cluster)])
```
# What the analysis does
Sweden's 290 municipalities are routinely sorted into urban and rural by administrative or population-size rules. These categories are convenient but external to the data. The work reported here builds a typology *from* the data: we look at how municipalities differ on a wide set of structural variables (education, employment, housing, mobility, demography) and let those differences group municipalities together.
There are two steps. First, a correspondence analysis (CA) places every municipality in a low-dimensional space where distance reflects how similar the municipalities are on the active variables. Second, a hierarchical clustering on that space (Ward's method) cuts the cloud into a small number of groups. The (preliminary) result is a four-cluster typology that runs from industrial and rural municipalities at one end to wealthy metropolitan suburbs and knowledge hubs at the other.
# Inputs and preparation
The data come from Statistics Sweden via the project's 2022 sampling. Six variable blocks are used as **active**:
- **Education:** Four levels of attainment (primary/lower secondary through post-graduate)
- **Employment:** 16 activity sectors (agriculture through arts & recreation)
- **Housing:** Rented, tenant-owned, and owner-occupied dwellings
- **Workplace:** Commuters in, commuters out, working & living in the same municipality
- **Migration:** Inmigrations and outmigrations
- **Demography:** Retirees and number of localities
Two further blocks are projected as **supplementary**:
- **Educational provision:** Counts of preschool, primary, secondary, adult and higher-education units by ownership (total, public, private)
- **Opinion:** Survey-based satisfaction on preschool, elementary and high school
## Two adjustments to the previous CA
A plain CA on the raw counts puts Stockholm, Göteborg and Malmö far out on the first axis simply because they are large. The axis becomes a size axis and tells us little about *how* municipalities differ. Two adjustments fix this:
1. **Block normalisation.** Within each variable block, every municipality is rescaled so that its block-total is the same constant. After this step every municipality contributes the same row mass to the CA, and every block contributes the same total weight. The axes are no longer dominated by absolute counts.
2. **Provision as a per-capita rate.** The supplementary provision counts are divided by a population proxy (the row sum of the education block) so that "five universities in Stockholm" and "one university in Skellefteå" are compared on the rate scale rather than the raw count scale. Without this step rare types like *university* or *Sami school* are tugged to extreme positions and clutter the biplot (they still are, but not as much).
## A note on the demography block
Two of the variables we wanted to use, *Number of retirees* and *Number of localities*, do not belong to any of the natural topics described above. They are stand-alone counts that, on their own, would simply scale with municipality size.
The normalisation step we use needs at least **two variables** in a block. With only one variable, the rescaling "divide each row by its own block total" turns every value into the same constant, and the variable carries no information into the CA. So a one-variable block disappears from the analysis.
We had three options for retirees and localities:
- **Drop them.** They would not influence the result at all. We would lose information that is meaningful to the urbanrural question (an older population, a denser locality structure).
- **Attach each to an existing topic.** There is no natural home: retirees are not "employment" and localities are not "housing".
- **Group the two together into their own block** ("Demography"). Each municipality's retirees and localities are rescaled so that the two of them add up to the same constant, just like every other block.
I went with the third option for now. The price is that the resulting share "retirees share of (retirees + localities)" mixes units (people and places), so it is not a tidy proportion in the way "share of housing that is rented" is. The benefit is that the *contrast* it captures is meaningful: a municipality with many retirees per locality (an ageing population concentrated in a few settlements) sits on one side, and one with many localities per resident retiree (a population spread thinly across many small settlements) on the other. That contrast lines up with the urbanrural gradient we care about, and so it's kept (for now).
# The space of municipalities
```{r tbl-eig}
#| tbl-cap: "Eigenvalues of the first five CA dimensions. Dimensions 1 and 2 together account for two-thirds of the variability."
eig <- as.data.frame(afc$eig[1:5, ]) |>
rownames_to_column("Dimension") |>
transmute(
Dimension = str_replace(Dimension, "^dim ", "Dim "),
Eigenvalue = round(eigenvalue, 4),
`% inertia` = round(`percentage of variance`, 1),
`Cumulative %` = round(`cumulative percentage of variance`, 1)
)
kbl(eig, booktabs = TRUE, linesep = "") |>
kable_styling(font_size = 9, latex_options = "hold_position")
```
```{r fig-biplot}
#| fig-width: 9
#| fig-height: 6.5
#| fig-cap: "CA biplot, first two dimensions. Municipalities (dots) are coloured by their cluster (see next section). The most contributive active variables are shown in red."
row_df <- as.data.frame(afc$row$coord[, 1:2]) |>
rownames_to_column("municipality") |>
left_join(clusters, by = "municipality")
contribs <- as.data.frame(afc$col$contrib) |>
rownames_to_column("variable") |>
mutate(total = `Dim 1` + `Dim 2`) |>
arrange(desc(total)) |>
head(15)
col_df <- as.data.frame(afc$col$coord[, 1:2]) |>
rownames_to_column("variable") |>
filter(variable %in% contribs$variable)
label_munis <- c("Stockholm", "Göteborg", "Malmö", "Uppsala", "Lund", "Umeå",
"Linköping", "Solna", "Danderyd", "Kiruna", "Gotland",
"Knivsta", "Falköping", "Tomelilla")
ggplot() +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey60") +
geom_point(data = row_df, aes(`Dim 1`, `Dim 2`, colour = cluster_label),
alpha = 0.7, size = 1.8) +
geom_text_repel(data = row_df |> filter(municipality %in% label_munis),
aes(`Dim 1`, `Dim 2`, label = municipality),
size = 2.7, colour = "grey20", family = "source_sans_3",
max.overlaps = 30, segment.size = 0.2) +
geom_point(data = col_df, aes(`Dim 1`, `Dim 2`),
shape = 17, colour = "firebrick", size = 2) +
geom_text_repel(data = col_df, aes(`Dim 1`, `Dim 2`, label = variable),
colour = "firebrick", size = 2.7, family = "source_sans_3",
max.overlaps = 30, segment.size = 0.2) +
scale_colour_manual(values = cluster_palette |> set_names(cluster_labels),
name = "Cluster") +
labs(
x = paste0("Dim 1 (", round(afc$eig[1, 2], 1), "%)"),
y = paste0("Dim 2 (", round(afc$eig[2, 2], 1), "%)")
) +
theme(legend.position = "bottom")
```
**Dim 1 (`r round(afc$eig[1, 2], 1)`%)** is a rural-to-urban axis. To the left sit modalities tied to extractive and traditional employment (agriculture, mining & manufacturing), owner-occupied housing, and primary or upper-secondary education; the municipalities there are northern, industrial or sparsely populated (e.g. Kiruna, Gotland, Falköping). To the right sit modalities tied to the knowledge economy (IT & communication, finance, professional services), post-secondary and post-graduate attainment, and tenant-owned (apartment) housing; the right pole is anchored by Solna, Stockholm, Sundbyberg, Lund and Uppsala.
**Dim 2 (`r round(afc$eig[2, 2], 1)`%)** distinguishes municipalities by how their residents relate to the labour market. The top of the axis is anchored by "Working & living in the municipality"; self-contained labour markets, often regional centres like Göteborg, Malmö, Umeå and Uppsala. The bottom is dominated by "Commuters out"; residential municipalities whose population works elsewhere, typically the Stockholm-region suburbs (Knivsta, Salem, Staffanstorp).
# A first attempt at an empirical typology: four clusters
Hierarchical clustering (Ward) on the CA row coordinates yields four clusters; the gap between four and three clusters in the inertia profile is large, while five clusters add only a small refinement.
```{r fig-dendrogram}
#| fig-width: 9
#| fig-height: 4
#| fig-cap: "Ward dendrogram on the CA coordinates. The cut into four clusters falls where the merge cost rises sharply."
tree <- hcpc$call$t$tree
h_max <- max(tree$height)
fviz_dend(tree, k = 4, show_labels = FALSE,
rect = TRUE, rect_border = unname(cluster_palette),
rect_fill = TRUE, k_colors = unname(cluster_palette),
main = "", ylab = "Merge distance (Ward)") +
coord_cartesian(ylim = c(0, h_max * 1.05)) +
guides(linewidth = "none") +
theme(plot.title = element_blank())
```
```{r tbl-sizes}
#| tbl-cap: "Cluster sizes and labels."
clusters |>
count(cluster, cluster_label, name = "n") |>
mutate(`%` = round(100 * n / sum(n), 1)) |>
transmute(Cluster = cluster, Label = cluster_label,
`Municipalities` = n, `Share (%)` = `%`) |>
kbl(booktabs = TRUE, linesep = "") |>
kable_styling(font_size = 9, latex_options = "hold_position")
```
## Cluster profiles
For every cluster we list the variables whose mean inside the cluster departs most strongly from the national mean (`v.test` statistic from `HCPC`). A high positive value means the variable is *over*-represented in the cluster.
```{r}
#| results: asis
desc_table <- function(cluster_id, n_show = 6) {
d <- hcpc$desc.var[[as.character(cluster_id)]] |>
as.data.frame(check.names = FALSE) |>
rownames_to_column("Variable") |>
arrange(desc(v.test)) |>
head(n_show) |>
transmute(
Variable = str_replace_all(Variable, "\\.", " "),
`Share in cluster (%)` = round(`Intern %`, 2),
`Share overall (%)` = round(`glob %`, 2),
`v.test` = round(v.test, 1)
)
d
}
paragon_list <- function(cluster_id, n_show = 8) {
paste(names(head(hcpc$desc.ind$para[[as.character(cluster_id)]], n_show)),
collapse = ", ")
}
for (k in 1:4) {
cat("\n\n### Cluster ", k, ": ", cluster_labels[as.character(k)], " \n\n", sep = "")
cat("**Typical municipalities** (closest to the cluster centre): *",
paragon_list(k), ".*\n\n", sep = "")
print(
kbl(desc_table(k), booktabs = TRUE, linesep = "",
caption = paste0("Top over-represented variables, cluster ", k, ".")) |>
kable_styling(font_size = 9, latex_options = c("hold_position", "scale_down"))
)
}
```
## What each cluster looks like
```{r fig-cluster-biplot}
#| fig-width: 9
#| fig-height: 6.5
#| fig-cap: "Cluster centroids on the CA plane. The arrows give an at-a-glance view of where each cluster sits in the rural-urban (Dim 1) and self-contained-vs-commuter (Dim 2) plane."
centroids <- row_df |>
group_by(cluster_label) |>
summarise(
`Dim 1` = mean(`Dim 1`),
`Dim 2` = mean(`Dim 2`),
.groups = "drop"
)
ggplot(row_df, aes(`Dim 1`, `Dim 2`, colour = cluster_label)) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey70") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey70") +
geom_point(alpha = 0.45, size = 1.6) +
stat_ellipse(level = 0.7, linewidth = 0.6) +
geom_point(data = centroids, size = 4, shape = 18, colour = "black") +
geom_label_repel(data = centroids, aes(label = cluster_label),
fill = "white", colour = "black",
family = "source_sans_3", size = 3.2,
label.size = 0.2) +
scale_colour_manual(values = cluster_palette |> set_names(cluster_labels),
guide = "none") +
labs(x = paste0("Dim 1 (", round(afc$eig[1, 2], 1), "%)"),
y = paste0("Dim 2 (", round(afc$eig[2, 2], 1), "%)"))
```
#### Cluster 1: Industrial-rural towns. {.unnumbered}
The largest cluster (`r sum(clusters$cluster == 1)` municipalities). These are self-contained, often Norrland or central-Sweden municipalities. Residents tend to work and live in the same place, employment is concentrated in mining, manufacturing and agriculture, housing is overwhelmingly owner-occupied, and the typical educational ceiling is upper secondary. Kiruna and Gotland sit at the cluster's far edge.
#### Cluster 2: Peri-rural commuter belt. {.unnumbered}
`r sum(clusters$cluster == 2)` municipalities. Smaller southern and central rural municipalities (Tomelilla, Osby, Klippan, Sölvesborg) where many residents commute out to a nearby labour market. Owner-occupied housing dominates, construction and agriculture are visible employment sectors, and survey opinions of local high schools tend to be *worse* than the national average. Consistent with a thin local educational offer that pushes choices elsewhere.
#### Cluster 3: Regional service centres. {.unnumbered}
`r sum(clusters$cluster == 3)` municipalities, including Göteborg, Malmö, Umeå, Uppsala, Linköping, Kalmar, Borås, Växjö and Gävle. They share the markers of a mid-sized regional city: substantial rented and tenant-owned housing, residents who work where they live, public administration, post-secondary attainment, and average satisfaction scores on local services. These are Sweden's "self-contained" mid-tier cities.
#### Cluster 4: Metropolitan and knowledge hubs. {.unnumbered}
`r sum(clusters$cluster == 4)` municipalities, mostly Stockholm-county suburbs (Huddinge, Tyresö, Knivsta, Sundbyberg, Solna, Danderyd, Lidingö, Täby, Nacka) plus Stockholm and Lund. Outbound commuting, finance, IT, professional services, post-secondary and post-graduate attainment all surge above the national mean; tenant-owned (apartment) housing is the norm. This cluster captures both the inner-city core of the Stockholm region and the wealthy residential suburbs that depend on it.
## Geography of the clusters
```{r fig-county}
#| fig-width: 9
#| fig-height: 5.5
#| fig-cap: "Cluster composition by county. Counties are ordered roughly south to north. Cluster 1 (industrial-rural) and 2 (peri-rural) dominate outside the metropolitan areas; cluster 4 (metro & knowledge hubs) is overwhelmingly Stockholm-, Uppsala- and Skåne-based; cluster 3 (regional centres) is spread thinly across most counties."
county_order <- c(
"Skåne", "Blekinge", "Halland", "Kronoberg", "Kalmar", "Gotland",
"Jönköping", "Östergötland", "Södermanland", "Västra Götaland",
"Örebro", "Västmanland", "Stockholm", "Uppsala", "Dalarna", "Värmland",
"Gävleborg", "Västernorrland", "Jämtland", "Västerbotten", "Norrbotten"
)
panel_raw <- readxl::read_excel("data/Municipalities_db_2.xlsx",
col_types = "text", n_max = 290) |>
transmute(municipality, code = str_pad(code, 4, "left", "0"))
clusters_geo <- clusters |>
left_join(panel_raw, by = "municipality") |>
mutate(county = county_names[str_sub(code, 1, 2)]) |>
filter(!is.na(county))
clusters_geo |>
count(county, cluster_label) |>
mutate(county = factor(county, levels = county_order)) |>
ggplot(aes(county, n, fill = cluster_label)) +
geom_col(position = "fill") +
scale_fill_manual(values = cluster_palette |> set_names(cluster_labels),
name = "Cluster") +
scale_y_continuous(labels = scales::percent, expand = c(0, 0)) +
labs(x = NULL, y = "Share of municipalities") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
legend.position = "bottom")
```
# Takeaways
The four clusters lay out a coherent rural-to-urban gradient that the standard "urban / rural" dichotomy hides. The biggest empirical break is not between metro and non-metro but between two kinds of rural area: industrial-rural towns with their own labour market (Cluster 1) versus peri-rural municipalities that depend on outbound commuting (Cluster 2). At the urban end, the data also separate two kinds of city: regional service centres that work as self-contained labour markets (Cluster 3) and metropolitan-region municipalities tied to the knowledge economy (Cluster 4).
A second observation is that the second axis--self-contained labour market vs. outbound-commuter base--matters as much for cluster membership as the first. Sweden's "wealthy suburb" municipalities and its mid-sized university cities sit on the same (right) side of the rural-urban axis but on opposite sides of the labour-market axis, and the clustering picks that distinction up cleanly.
The typology is a starting point: a four-cluster cut over-aggregates Stockholm, Solna and Sundbyberg with the surrounding affluent suburbs, and it pools the remote Norrland municipalities with the central industrial towns. Cutting the same tree at six clusters splits both of those merges; the next section shows that finer view.
# An optional finer cut: six clusters
Cutting the Ward tree at six clusters keeps the two middle clusters from the four-cluster cut (the peri-rural commuter belt and the regional service centres) and splits the other two:
- The **industrial-rural towns** divide into **northern/remote** municipalities (Kiruna, Gotland, Piteå, Skellefteå) and **central/southern industrial** towns (Falköping, Lindesberg, Hedemora).
- The **metropolitan & knowledge hubs** divide into the **affluent suburbs and university satellites** (Lund, Mölndal, Partille, Huddinge, Knivsta, Tyresö, Kungsbacka, ...) and a small **inner Stockholm core** (Stockholm, Solna, Sundbyberg, Danderyd, Lidingö, Täby, Nacka, Sollentuna).
```{r setup-6}
#| include: false
hcpc6 <- read_rds("data/processed/proportions_HCPC_6.rds")
clusters6 <- read_csv("data/processed/cluster_assignment_6.csv",
show_col_types = FALSE) |>
rename(cluster = c6)
cluster_labels_6 <- c(
"1" = "Northern & remote",
"2" = "Central industrial towns",
"3" = "Peri-rural commuter belt",
"4" = "Regional service centres",
"5" = "Affluent suburbs & university satellites",
"6" = "Inner Stockholm core"
)
cluster_palette_6 <- c(
"1" = "#B07F4F",
"2" = "#D7A86E",
"3" = "#7CB07C",
"4" = "#6FA8DC",
"5" = "#C2738B",
"6" = "#7A3A4F"
)
clusters6 <- clusters6 |>
mutate(cluster_label = cluster_labels_6[as.character(cluster)])
```
```{r tbl-sizes-6}
#| tbl-cap: "Six-cluster cut: cluster sizes and labels."
clusters6 |>
count(cluster, cluster_label, name = "n") |>
mutate(`%` = round(100 * n / sum(n), 1)) |>
transmute(Cluster = cluster, Label = cluster_label,
`Municipalities` = n, `Share (%)` = `%`) |>
kbl(booktabs = TRUE, linesep = "") |>
kable_styling(font_size = 9, latex_options = "hold_position")
```
```{r fig-biplot-6}
#| fig-width: 9
#| fig-height: 6.5
#| fig-cap: "CA biplot, six-cluster cut. The two splits relative to the four-cluster view cuts the cloud out along Dim 1: the northern/remote group sits further left than the central industrial towns, and the inner Stockholm core sits further right than the suburban ring."
row_df6 <- as.data.frame(afc$row$coord[, 1:2]) |>
rownames_to_column("municipality") |>
left_join(clusters6, by = "municipality")
label_munis_6 <- c("Stockholm", "Solna", "Sundbyberg", "Danderyd", "Lidingö",
"Lund", "Uppsala", "Göteborg", "Malmö", "Umeå", "Linköping",
"Knivsta", "Partille", "Mölndal",
"Kiruna", "Gotland", "Piteå", "Skellefteå",
"Falköping", "Lindesberg", "Tomelilla")
ggplot(row_df6, aes(`Dim 1`, `Dim 2`, colour = cluster_label)) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey60") +
geom_point(alpha = 0.75, size = 1.8) +
geom_text_repel(data = row_df6 |> filter(municipality %in% label_munis_6),
aes(label = municipality),
size = 2.7, colour = "grey20", family = "source_sans_3",
max.overlaps = 30, segment.size = 0.2) +
scale_colour_manual(values = cluster_palette_6 |> set_names(cluster_labels_6),
name = "Cluster") +
labs(
x = paste0("Dim 1 (", round(afc$eig[1, 2], 1), "%)"),
y = paste0("Dim 2 (", round(afc$eig[2, 2], 1), "%)")
) +
theme(legend.position = "bottom")
```
## What each new sub-cluster looks like
```{r}
#| results: asis
desc_table_6 <- function(cluster_id, n_show = 6) {
hcpc6$desc.var[[as.character(cluster_id)]] |>
as.data.frame(check.names = FALSE) |>
rownames_to_column("Variable") |>
arrange(desc(v.test)) |>
head(n_show) |>
transmute(
Variable = str_replace_all(Variable, "\\.", " "),
`Share in cluster (%)` = round(`Intern %`, 2),
`Share overall (%)` = round(`glob %`, 2),
`v.test` = round(v.test, 1)
)
}
paragon_list_6 <- function(cluster_id, n_show = 8) {
paste(names(head(hcpc6$desc.ind$para[[as.character(cluster_id)]], n_show)),
collapse = ", ")
}
for (k in 1:6) {
cat("\n\n### Cluster ", k, ": ", cluster_labels_6[as.character(k)], " \n\n", sep = "")
cat("**Typical municipalities**: *", paragon_list_6(k), ".*\n\n", sep = "")
print(
kbl(desc_table_6(k), booktabs = TRUE, linesep = "",
caption = paste0("Top over-represented variables, six-cluster cut, cluster ", k, ".")) |>
kable_styling(font_size = 9, latex_options = c("hold_position", "scale_down"))
)
}
```
## Reading the two splits
#### North vs. central industrial. {.unnumbered}
Both subgroups inherit the rural-industrial profile of the four-cluster cut (owner-occupied housing, upper-secondary education, working and living in the same municipality). The northern group is distinguished by an even stronger weight of agriculture/forestry/fishing, adult education (komvux) and preschool-class infrastructure. These are sparsely populated municipalities that have to provide a full educational service on their own at every age. The central group keeps mining & manufacturing as its single strongest difference. These are old industrial towns rather than remote rural ones.
#### Suburban ring vs. inner Stockholm core. {.unnumbered}
The suburban/university group (31 municipalities) is dominated by outbound commuters, post-secondary attainment and tenant-owned housing. These are residential satellites whose labour markets sit elsewhere. The inner core (8 municipalities) tightens the same profile much further: apartment housing, IT and finance employment, and inbound commuting all jump to extreme levels (several `v.test = Inf`). The split is meaningful because Stockholm, Solna, Sundbyberg and Lidingö are *destinations* in the commuting network, while Knivsta, Partille and Staffanstorp are *origins*. The four-cluster cut treats them as one population because they are similar relative to the rest of Sweden, but they sit at the two ends of the metropolitan labour market.