ruralitic-qrm/txt/Clara/CA_municipalities_2.qmd

---
title: "CAs for the municipalities dataset"
author: "Clara Comte"
lang: en
format:
  pdf:
    include-in-header:
      text: |
        \usepackage{xcolor}
        \usepackage{makecell}
        \usepackage[most]{tcolorbox}
        \definecolor{bleupastel}{RGB}{200,220,235}
        \definecolor{jaunepale}{RGB}{255, 255, 200}
        \definecolor{violetpastel}{RGB}{230, 210, 250}
        \newtcolorbox{fluobox}{colback=bleupastel, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
        \newtcolorbox{zone4box}{colback=jaunepale, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
        \newtcolorbox{violetbox}{colback=violetpastel, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
editor: visual
fig-width: 10
fig-height: 8
---

```{r}
#| echo: false
#| results: hide
#| message: false
#| warning: false

library(readxl)
library(writexl)
library(tibble)
library(FactoMineR)
library(factoextra)
library(showtext)
library(sysfonts)
library(tidyverse)
library(ggrepel)
library(knitr)

col_types <- c("numeric", "numeric", "text", rep("numeric", 284))
municipalities <- read_excel("Municipalities_db.xlsx", col_types = col_types)
```

## CA n°1: Municipalities and educational levels

Let's do a CA with municipalities and educational levels. In this case, I chose to display two biplots: one with every municipality, without labels for more visibility, and one with labels, but with only the 20 municipalities that contribute the most to the CA.

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false

sysfonts::font_add("CMU Serif", regular = "/home/clara/fonts/cmu/cmunrm.ttf", italic = "/home/clara/fonts/cmu/cmunoti.ttf", bold = "/home/clara/fonts/cmu/cmunrb.ttf", bolditalic = "/home/clara/fonts/cmu/cmunbi.ttf")
showtext::showtext_auto()

muni_2022 <- municipalities[municipalities$year == 2022, ]
tab <- muni_2022[, c("municipality", "Education level_primary_secondary", "Education level_upper_secondary", "Education level_post-secondary", "Education level_post-graduate")]
tab <- column_to_rownames(tab, var = "municipality")
colnames(tab) <- c("Primary", "Upper secondary", "Post-secondary", "Post-graduate")
tab <- as.matrix(tab)
afc <- CA(tab, graph = FALSE)

showtext::showtext_auto()
showtext::showtext_opts(dpi = 300)
theme_set(theme_gray(base_family = "CMU Serif"))

fviz_ca_biplot(afc, repel = TRUE, col.col = "red", label = "col", font.family = "CMU Serif", title = "CA of Municipalities n°1 (2022): Biplot n°1") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5))

fviz_ca_biplot(afc, repel = TRUE, select.row = list(contrib = 20), col.col = "red",
               font.family = "CMU Serif", title = "CA of Municipalities n°1 (2022): Biplot n°2") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
    axis.text = element_text(family = "CMU Serif"),
    plot.title = element_text(family = "CMU Serif", hjust = 0.5))
```

The first axis explains 91 % of the total inertia, which makes it highly important in this CA. We can understand it as an educational gradient as it opposes primary and upper secondary levels on the left of the axis, and post-secondary and post-graduate levels to the right. Cities like Stockholm, Uppsala or Lund are highly educated while Eskilstuna or Gislaved are less.

The second axis, that explains 5 % of the inertia, seems highly structured by the post-graduate modalities. At the top of the biplot, we find cities such as Uppsala or Lund, known as Sweden's two major university cities, that stand out clearly from all other municipalities. While at the bottom, we find large cities like Stockholm or Malmö that have high post-secondary levels but relatively few post-graduates.

Overall, this graph firstly suggests that the axis 1 explains 91 % of the variance shows how the educational gradient is the dominant structural feature of the data. Then, this graph suggests significant inequalities between a minority of highly educated municipalities (Lund, Uppsala, Solna...) and poorly educated cities (Gislaved, Eskilstuna, Norrtälje...).

NB: post-secondary = bachelor/master while ; post-graduate = doctoral/licentiate.

```{=latex}
\newpage
```

## CA n°2: A multiple variables municipalities CA (housing, employment, education)

For the previous CA, I used the dataset as it is because the variable I used – educational level – had data for the 2022 year. Now, to carry out a more complete CA with many active indicators, I will use Pablo's sampling script that aims to produce a single clean cross-section anchored to 2022.

```{r}
#| echo: false
#| message: false
#| warning: false

write_rds(municipalities, "m_raw.rds")

# =============================================================================
	#  01-sampling.R  ·  2022 cross-section for municipalities CA
	# =============================================================================
	#
	#  Takes the full dataset (1968–2026, 290 municipalities) and produces a single
	#  clean cross-section anchored to 2022.
	#
	#  Each variable is assigned one of five actions:
	#
	#    keep                 complete (or near-complete) in 2022; used as-is.
	#                         Variables with a handful of NAs and no backfill data
	#                         (e.g. election variables, 3-4 missing per cycle)
	#                         are also kept; residual NAs are noted in the audit.
	#
	#    backfill_window      missing in 2022 (fully or partially), but data exist
	#                         in [2020, 2021]: fill each municipality from its most
	#                         recent non-NA value in that window.
	#
	#    backfill_census      100% missing in 2022, but the series is still active
	#                         (data exists AFTER 2022, i.e. a periodic/census
	#                         variable such as EU parliament elections).
	#                         Strategy: use the year closest to 2022, looking both
	#                         backwards and forwards.
	#
	#    backfill_disc        100% missing in 2022, and the series was discontinued
	#                         before 2022 (data only exists in the past).
	#                         Strategy: use the most recent available year for each
	#                         municipality, regardless of how long ago it was.
	#
	#    drop                 no data at all in the full panel; excluded.
	#
	#  Outputs:
	#    data/processed/m_sample.rds       : 290 municipalities × retained variables
	#    data/processed/sampling_audit.csv : variable-level audit log

	library(tidyverse)

	# 00-Load ----------------------------------------------------------------------
	municipalities_raw <- read_rds("m_raw.rds")
	# Panel: 17 110 rows (290 municipalities × ~59 years), 257 columns.

	# 01-Characterise each variable's availability ---------------------------------
	# 1a. Full dataset: first and last year with any non-NA value.
	availability_full <- municipalities_raw |>
	  pivot_longer(
	    cols = -c(year, code, municipality),
	    names_to = "variable",
	    values_to = "value"
	  ) |>
	  filter(!is.na(value)) |>
	  summarise(
	    .by = variable,
	    first_year = min(year),
	    last_year = max(year)
	  )

	# 1b. Backfill window [2020, 2021]: the most recent year with any non-NA value.
	#     If a variable has nothing here, the window-backfill path is unavailable.
	availability_window <- municipalities_raw |>
	  filter(between(year, 2020, 2021)) |>
	  pivot_longer(
	    cols = -c(year, code, municipality),
	    names_to = "variable",
	    values_to = "value"
	  ) |>
	  filter(!is.na(value)) |>
	  summarise(
	    .by = variable,
	    window_last_year = max(year)
	  )

	# 02-Extract the 2022 slice and assess NAs -------------------------------------
	m_2022 <- municipalities_raw |> filter(year == 2022)
	n_munic <- nrow(m_2022) # 290

	na_2022 <- m_2022 |>
	  summarise(across(-c(year, code, municipality), \(x) sum(is.na(x)))) |>
	  pivot_longer(everything(), names_to = "variable", values_to = "n_na") |>
	  mutate(pct_na = n_na / n_munic)

	# 03-Classify every variable ---------------------------------------------------
	variable_plan <- na_2022 |>
	  left_join(availability_full, by = "variable") |>
	  left_join(availability_window, by = "variable") |>
	  mutate(
	    action = case_when(
	      # Already fine in 2022 (including partial NAs with no window fill available)
	      n_na == 0 ~ "keep",
	      n_na > 0 & n_na < n_munic & is.na(window_last_year) ~ "keep",

	      # Gap in 2022 but window data available; standard window backfill
	      n_na > 0 & !is.na(window_last_year) ~ "backfill_window",

	      # Fully missing in 2022 series is still active (data after 2022); periodic
	      n_na == n_munic &
	        is.na(window_last_year) &
	        !is.na(last_year) &
	        last_year > 2022 ~ "backfill_census",

	      # Fully missing in 2022; series ended before 2022; discontinued
	      n_na == n_munic &
	        is.na(window_last_year) &
	        !is.na(last_year) &
	        last_year <= 2022 ~ "backfill_disc",

	      # No data anywhere in the dataset
	      is.na(last_year) ~ "drop"
	    ),
	    # Sub-type for reporting
	    fill_type = case_when(
	      action == "backfill_window" & n_na == n_munic ~ "full_column",
	      action == "backfill_window" & n_na < n_munic ~ "partial",
	      action == "keep" & n_na > 0 ~ "residual_na",
	      action == "backfill_census" ~ "census_closest",
	      action == "backfill_disc" ~ "discontinued_last",
	      TRUE ~ NA_character_
	    )
	  )

	# 04-Backfill A: window [2020, 2021] Per-municipality, most recent non-NA value.
	vars_window <- variable_plan |>
	  filter(action == "backfill_window") |>
	  pull(variable)

	window_long <- municipalities_raw |>
	  filter(between(year, 2020, 2021)) |>
	  select(year, code, all_of(vars_window)) |>
	  pivot_longer(
	    all_of(vars_window),
	    names_to = "variable",
	    values_to = "value"
	  ) |>
	  filter(!is.na(value)) |>
	  group_by(code, variable) |>
	  slice_max(year, n = 1, with_ties = FALSE) |>
	  ungroup()

	window_source_year <- window_long |>
	  group_by(variable) |>
	  summarise(source_year = max(year), .groups = "drop")

	window_wide <- window_long |>
	  select(-year) |>
	  pivot_wider(names_from = variable, values_from = value)

	# 05-Backfill B: census / periodic Per-municipality, year closest to 2022 in
	# either direction.
	# (Applies to EU parliament election variables: 2024 is 2 years away,
	# 2019 is 3 years away, so 2024 will be selected for all municipalities.)
	# ------------------------------------------------------------------------------
	vars_census <- variable_plan |>
	  filter(action == "backfill_census") |>
	  pull(variable)

	census_long <- municipalities_raw |>
	  filter(year != 2022) |>
	  select(year, code, all_of(vars_census)) |>
	  pivot_longer(
	    all_of(vars_census),
	    names_to = "variable",
	    values_to = "value"
	  ) |>
	  filter(!is.na(value)) |>
	  mutate(distance = abs(year - 2022)) |>
	  group_by(code, variable) |>
	  slice_min(distance, n = 1, with_ties = FALSE) |>
	  ungroup()

	census_source_year <- census_long |>
	  group_by(variable) |>
	  summarise(source_year = max(year), .groups = "drop")

	census_wide <- census_long |>
	  select(-year, -distance) |>
	  pivot_wider(names_from = variable, values_from = value)

	# 06-Backfill C: discontinued --------------------------------------------------
	vars_disc <- variable_plan |>
	  filter(action == "backfill_disc") |>
	  pull(variable)

	disc_long <- municipalities_raw |>
	  filter(year < 2022) |>
	  select(year, code, all_of(vars_disc)) |>
	  pivot_longer(all_of(vars_disc), names_to = "variable", values_to = "value") |>
	  filter(!is.na(value)) |>
	  group_by(code, variable) |>
	  slice_max(year, n = 1, with_ties = FALSE) |>
	  ungroup()

	disc_source_year <- disc_long |>
	  group_by(variable) |>
	  summarise(source_year = max(year), .groups = "drop")

	disc_wide <- disc_long |>
	  select(-year) |>
	  pivot_wider(names_from = variable, values_from = value)

	# 07-Apply all fills (window first, then census and disc which only touch the
	# still-NA cells, i.e. the 100%-missing variables ------------------------------
	m_2022_filled <- m_2022 |>
	  rows_patch(window_wide, by = "code", unmatched = "ignore") |>
	  rows_patch(census_wide, by = "code", unmatched = "ignore") |>
	  rows_patch(disc_wide, by = "code", unmatched = "ignore")

	# 08-Remove truly empty variables and the redundant year column ----------------
	vars_drop <- variable_plan |> filter(action == "drop") |> pull(variable)

	m_sample <- m_2022_filled |>
	  select(-all_of(vars_drop), -year)

	# Check residual NAs
	na_remaining <- m_sample |>
	  summarise(across(-c(code, municipality), \(x) sum(is.na(x)))) |>
	  pivot_longer(everything(), names_to = "variable", values_to = "n_na_final") |>
	  filter(n_na_final > 0)

	# 09-Audit ---------------------------------------------------------------------
	# Combine source-year info from all three fill paths
	source_years <- bind_rows(
	  window_source_year,
	  census_source_year,
	  disc_source_year
	)

	sampling_audit <- variable_plan |>
	  left_join(source_years, by = "variable") |>
	  left_join(na_remaining, by = "variable") |>
	  mutate(
	    n_na_final = replace_na(n_na_final, 0L),
	    years_from_2022 = if_else(
	      !is.na(source_year),
	      abs(source_year - 2022L),
	      NA_integer_
	    ),
	    note = case_when(
	      action == "keep" & is.na(fill_type) ~
	        "Complete in 2022",

	      fill_type == "residual_na" ~
	        paste0(
	          n_na,
	          " municipalities (",
	          round(pct_na * 100, 1),
	          "%) have no 2022 value ",
	          "and no [2020, 2021] data (variable measured only in specific years, e.g. elections)"
	        ),

	      action == "drop" ~
	        "Dropped — no data anywhere in the panel",

	      fill_type == "full_column" ~
	        paste0(
	          "Entire column backfilled from ",
	          source_year,
	          " (100% missing in 2022)"
	        ),

	      fill_type == "partial" ~
	        paste0(
	          round(pct_na * 100, 1),
	          "% missing in 2022; ",
	          "per-municipality fill from ≤",
	          source_year,
	          if_else(
	            n_na_final > 0,
	            paste0("; ", n_na_final, " municipalities remain NA"),
	            "; fully resolved"
	          )
	        ),

	      fill_type == "census_closest" ~
	        paste0(
	          "Periodic variable — no data in 2022; backfilled from ",
	          source_year,
	          " (",
	          years_from_2022,
	          " years from anchor)"
	        ),

	      fill_type == "discontinued_last" ~
	        paste0(
	          "Series discontinued; backfilled from most recent available year (",
	          source_year,
	          ", ",
	          years_from_2022,
	          " years from anchor)"
	        )
	    )
	  ) |>
	  select(
	    variable,
	    action,
	    fill_type,
	    n_na_2022 = n_na,
	    pct_na_2022 = pct_na,
	    n_na_final,
	    first_year,
	    last_year,
	    source_year,
	    years_from_2022,
	    note
	  ) |>
	  arrange(action, fill_type, desc(years_from_2022))


	# 10-Save ----------------------------------------------------------------------
	write_rds(m_sample, "m_sample.rds")
	write_csv(sampling_audit, "sampling_audit.csv")
```

Here is a more complete CA, with education, employment and housing variables as active, and with opinion variables about education as supplementary. We show two biplots, each time with the two first dimensions of the space. The first biplots displays the 30 most important municipalities, with their labels, and the 10 most important modalities, with their labels as well. This helps identify the municipalities and how they position themselves in comparison with the modalities. The second biplot shows every municipality, but without their label, and the 20 most contributive modalities. This helps visualize the distribution of municipalities as a whole, which leaves room to see more modalities than the previous biplot (20 instead of 10).

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false

sysfonts::font_add("CMU Serif", regular = "/home/clara/fonts/cmu/cmunrm.ttf", italic = "/home/clara/fonts/cmu/cmunoti.ttf", bold = "/home/clara/fonts/cmu/cmunrb.ttf", bolditalic = "/home/clara/fonts/cmu/cmunbi.ttf")
showtext::showtext_auto()
showtext::showtext_opts(dpi = 300)
theme_set(theme_gray(base_family = "CMU Serif"))

# Use of the sampling
muni_2022 <- m_sample

# Recoding for the housing variables
muni_2022 <- muni_2022 |>
  mutate(`Rented dwellings_total` = `Rented dwellings_one- or two-dwelling buildings` + `Rented dwellings_multi-dwelling buildings` + `Rented dwellings_other buildings` + `Rented dwellings_special housing`, `Tenant-owned dwellings_total` = `Tenant-owned dwellings_one- or two-dwelling buildings` + `Tenant-owned dwellings_multi-dwelling buildings` + `Tenant-owned dwellings_other buildings` + `Tenant-owned dwellings_special housing`, `Owner-occupied dwellings_total` = `Owner-occupied dwellings_one- or two-dwelling buildings` + `Owner-occupied dwellings_multi-dwelling buildings` + `Owner-occupied dwellings_other buildings` + `Owner-occupied dwellings_special housing`)

# Active variables
vars_active <- c(
  # Education
  "Education level_primary_secondary", "Education level_upper_secondary",
  "Education level_post-secondary", "Education level_post-graduate",
  # Employment
  "Employment_Agriculture, forestry, hunting and fishing",
  "Employment_Mining, quarrying and manufacturing",
  "Employment_Electricity, gas and water supply, refuse disposal",
  "Employment_Construction", "Employment_Wholesale and retail trade; transport, storage and warehousing, post and telecommunications",
  "Employment_Personal and cultural service activities",
  "Employment_Financial institutions, real estate activities, business activities",
  "Employment_Public authorities, national defence, extra-territorial organizations",
  "Employment_Research and development, education",
  "Employment_Health and social work establishments",
  # Housing
  "Rented dwellings_total", "Tenant-owned dwellings_total", "Owner-occupied dwellings_total")

# Supplementary variables
vars_supp <- c("Opinion on preschool_Bad (%)", "Opinion on preschool_Mid (%)", "Opinion on preschool_Good (%)", "Opinion on elementary school_Bad (%)", "Opinion on elementary school_Mid (%)", "Opinion on elementary school_Good (%)", "Opinion on highschool_Bad (%)", "Opinion on highschool_Mid (%)", "Opinion on highschool_Good (%)")

vars_all <- c(vars_active, vars_supp)
tab <- muni_2022[, c("municipality", vars_all)]
tab <- column_to_rownames(tab, var = "municipality")
tab <- as.matrix(tab)
tab[is.na(tab)] <- 0
supp_idx <- (length(vars_active) + 1):ncol(tab)

colnames(tab) <- c(
  # Education
  "Primary & lower secondary", "Upper secondary", "Post-secondary", "Post-graduate",
  # Employment
  "Agriculture, forestry, fishing", "Mining & manufacturing", "Energy & water",
  "Construction", "Trade, transport & telecom", "Personal & cultural services",
  "Finance & real estate", "Public authorities", "R&D & education",
  "Health & social work",
  # Housing
  "Rented dwellings", "Tenant-owned dwellings", "Owner-occupied dwellings",
  # Supplementary
  "Bad opinion on preschool", "Mid opinion on preschool", "Good opinion on preschool",
  "Bad opinion on elementary school", "Mid opinion on elementary school",
  "Good opinion on elementary school", "Bad opinion on highschool",
  "Mid opinion on highschool", "Good opinion on highschool")
afc <- CA(tab, col.sup = supp_idx, graph = FALSE)

# First biplot, with 30 municipalities and 10 variables / modalities
fviz_ca_biplot(afc, repel = TRUE, col.col = "red", select.col = list(contrib = 10),
               font.family = "CMU Serif", select.row = list(contrib = 30),
               title = "CA of Municipalities n°2 (2022): Biplot n°1") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"),
        axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5))

# Second biplot, with every municipalities (dots) and 20 variables / modalities
fviz_ca_biplot(afc, repel = TRUE, label = "col", select.col = list(contrib = 20),
               col.col = "red", col.col.sup = "darkgreen", invisible = "none",
               font.family = "CMU Serif", title = "CA of Municipalities n°2 (2022): Biplot n°2") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5))
```

```{=latex}
\noindent\
```

This CA shows that the Swedish space of municipalities seems structured by two main dimensions.

The first axis explains 70 % of the total inertia, and it seems to represent an rural-urban educational gradient. Indeed, on the left side, we find variables such as post-graduate and post-secondary levels, personal and cultural service activities or tenant-owned dwellings, that can be associated with highly educated and urban municipalities. University cities like Lund, Stockholm or Uppsala, where people are highly educated and work in the tertiary sector and live in apartments, can be found in this category, as shows the first biplot. However on the right side of the axis, we find variables such as agricultural employment, primary and upper secondary education levels, and owner occupied dwellings, representing cities such as Sjöbo, Gislaved or Vara, smaller, more rural and less educated municipalities.

The second axis, that explains approximately 11 % of the variance, introduces a distinction within the less educated municipalities. It opposes sectors of employment, as we can find construction, owner-occupied dwellings and agricultural employment at the top, representing wealthy periurban or rural municipality (Danderyd, Täby, Kungsbacka), by opposition with rented dwellings, mining and manufacturing employment at the bottom, characterizing industrial municipalities (Gislaved, Södertälje, Göteborg).

This CA is more complete than the previous one as it allows to distinguish different types of less-educated municipalities, and takes into account housing supply variables, as well as employment variables.

```{=latex}
\noindent\
```

## CA n°3: An even more complete municipalities CA (5+ variables)

This first CA is a good start, but we need to add more indicators. Here is the list of indicators of the dataset that we should add to the new version of the CA:

- The number of retired people.

- The variable that indicates whether people work in the same municipality in which they live, or if they commute to another municipality everyday.

- Employment by activity sector, but the new version of this variable, that contains more detailed categories of employment sectors.

- The number of outmigrations, and the number of inmigrations.

- The number of localities (= urban areas), that is the only frequency variable that can approximate the proximity to a big city (it is not the same indicator, but it still indicates to a certain extent the degree of urbanity of the municipality).

- Institution types: Förskoleklass (Preschool class), Grundskola (Primary school), Anpassad grundskola (Adapted primary school), Specialskola (Special school), Sameskola (Sami school), Gymnasieskola (Secondary school), Anpassad gymnasieskola (Adapted secondary school), Komvux (Adult education), SFI (Swedish for immigrants), University, University college. And for each type, three measurements: number of institutions, number of public institutions and number of private institutions.

- \colorbox{jaunepale}{Frequency economic indicators} (we still miss those because right now we only have non-frequencies indicators, such as mean of median incomes).

Before doing the CA, we need to integrate the institution types and their number, number of private institutions and number of public institutions, to the dataset (m_sample, the new dataset that was filtered on 2022). We will thus need to use Pablo's script, that scrapes those variables from three websites: Skolverket, UKÄ and TODO, and add them to our municipalities dataset.

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

# =============================================================================
	#  skolverket.R  ·  Educational offer dataset by municipality
	# =============================================================================
	#
	#  Sources and coverage:
	#   1. data/skolenhetsadresser.xlsx      Skolverket school unit registry
	#      (current snapshot; all Skolverket-regulated institution types)
	#      Sheets: Förskoleklass, Grundskola, Anpassad grundskola, Specialskola,
	#              Sameskola, Gymnasieskola, Anpassad gymnasieskola, Komvux
	#   2. Hardcoded list                    higher education institutions
	#      Source: UKÄ (Universitetskanslersämbetet) register, ~40 institutions.
	#      Each institution-municipality pair is one row; multi-campus institutions
	#      appear under every municipality that hosts a campus.
	#   3. TODO: Yrkeshögskola               MYH (Myndigheten för yrkeshögskolan)
	#      open data at myh.se; no REST API identified so far.
	#   4. TODO: Folkhögskola                Folkbildningsrådet register at
	#      folkbildning.se; ~155 institutions across ~100+ municipalities.
	#
	#  Note: the Skolverket planned-educations API (api.skolverket.se) was
	#  explored but covers only the same types as the xlsx; it is used here
	#  purely as an optional check in section 03.
	#
	#  Output:
	#    data/processed/edu_offer.rds   :  municipality × indicator/count (wide)
	#    data/processed/edu_offer.csv   :  same, plain text

	library(tidyverse)
	library(readxl)

	# 00 – Helpers -----------------------------------------------------------------

XLSX_PATH <- "skolenhetsadresser.xlsx"

	# Standardise municipality code column to 4-char zero-padded string
	add_muni_code <- function(df) {
	  df |>
	    rename(
	      muni_code = `BELÄGEN I KOMMUN (KOD)`,
	      muni_name = `BELÄGEN I KOMMUN (NAMN)`
	    ) |>
	    mutate(muni_code = str_pad(as.character(muni_code), 4, "left", "0"))
	}

	# Public = Kommunal or Region (state-level public body); else private
	categorise_ownership <- function(df) {
	  mutate(
	    df,
	    ownership = if_else(
	      HUVUDMANNATYP %in% c("Kommunal", "Region", "Statlig"),
	      "public",
	      "private"
	    )
	  )
	}

	# Count units by municipality and ownership, then pivot to n_public / n_private / n_total
	count_units <- function(df, type_label) {
	  pivoted <- df |>
	    add_muni_code() |>
	    categorise_ownership() |>
	    count(muni_code, muni_name, ownership) |>
	    pivot_wider(names_from = ownership, values_from = n, values_fill = 0L)
	  # Ensure both columns exist even if one ownership type is absent
	  if (!"public" %in% names(pivoted)) {
	    pivoted$public <- 0L
	  }
	  if (!"private" %in% names(pivoted)) {
	    pivoted$private <- 0L
	  }
	  pivoted |>
	    mutate(
	      n_total = public + private,
	      type = type_label,
	      n_public = public,
	      n_private = private
	    ) |>
	    select(muni_code, muni_name, type, n_total, n_public, n_private)
	}

	# 01 – Read all xlsx sheets ----------------------------------------------------

	sheet_map <- c(
	  "forskoleklass" = "Förskoleklass",
	  "grundskola" = "Grundskola",
	  "anpassad_grundskola" = "Anpassad grundskola",
	  "specialskola" = "Specialskola",
	  "sameskola" = "Sameskola",
	  "gymnasieskola" = "Gymnasieskola",
	  "anpassad_gymnasieskola" = "Anpassad gymnasieskola",
	  "komvux" = "Komvux"
	)

	raw <- imap(sheet_map, \(sheet_name, type_label) {
	  cat("Reading sheet:", sheet_name, "\n")
	  read_excel(XLSX_PATH, sheet = sheet_name)
	})

	# 02 – Count institutions per municipality and ownership -----------------------

	unit_counts <- imap_dfr(raw, \(df, type_label) count_units(df, type_label))

	# Komvux: additionally extract SFI-offering units as a separate indicator.
	# Column "SVENSKA FÖR INVANDRARE" = "J" means the unit offers SFI.
	sfi_counts <- raw[["komvux"]] |>
	  add_muni_code() |>
	  categorise_ownership() |>
	  filter(`SVENSKA FÖR INVANDRARE` == "J") |>
	  count(muni_code, muni_name, ownership) |>
	  pivot_wider(names_from = ownership, values_from = n, values_fill = 0L) |>
	  (\(x) {
	    if (!"public" %in% names(x)) {
	      x$public <- 0L
	    }
	    x
	  })() |>
	  (\(x) {
	    if (!"private" %in% names(x)) {
	      x$private <- 0L
	    }
	    x
	  })() |>
	  mutate(
	    n_total = public + private,
	    type = "sfi",
	    n_public = public,
	    n_private = private
	  ) |>
	  select(muni_code, muni_name, type, n_total, n_public, n_private)

	unit_counts <- bind_rows(unit_counts, sfi_counts)

	# 03 – Skolverket API cross-check (optional) -----------------------------------
	# The planned-educations API returns the same institution types as the xlsx.
	# This block fetches the API data and reports any discrepancies between the two.
	# Comment out if offline or if the xlsx is known to be current.

	api_cross_check <- tryCatch(
	  {
	    cat("\nFetching Skolverket API for cross-check...\n")
	    base_url <- "https://api.skolverket.se/planned-educations/school-units"

	    fetch_page <- function(page) {
	      url <- paste0(base_url, "?page=", page, "&size=100")
	      resp <- readLines(url, warn = FALSE) |>
	        paste(collapse = "") |>
	        jsonlite::fromJSON()
	      resp$body
	    }

	    first <- fetch_page(0)
	    n_pages <- first$page$totalPages
	    cat(
	      "  API reports",
	      first$page$totalElements,
	      "units across",
	      n_pages,
	      "pages\n"
	    )

	    all_pages <- map(0:(n_pages - 1), \(p) {
	      if (p %% 10 == 0) {
	        cat("  page", p, "/", n_pages, "\n")
	      }
	      fetch_page(p)$`_embedded`$listedSchoolUnits
	    })

	    api_df <- bind_rows(all_pages) |>
	      transmute(
	        muni_code = str_pad(as.character(geographicalAreaCode), 4, "left", "0"),
	        ownership = if_else(
	          principalOrganizerType %in% c("Kommunal", "Region", "Statlig"),
	          "public",
	          "private"
	        ),
	        type = map_chr(typeOfSchooling, \(t) {
	          if (is.null(t) || nrow(t) == 0) {
	            return(NA_character_)
	          }
	          t$code[1]
	        })
	      ) |>
	      filter(!is.na(type))

	    api_summary <- api_df |>
	      count(muni_code, type, ownership, name = "n_api") |>
	      mutate(
	        type = recode(
	          type,
	          fsk = "forskoleklass",
	          gr = "grundskola",
	          gran = "anpassad_grundskola",
	          sp = "specialskola",
	          sam = "sameskola",
	          gy = "gymnasieskola",
	          gyan = "anpassad_gymnasieskola",
	          vuxgy = "komvux",
	          vuxgr = "komvux",
	          sfi = "sfi"
	        )
	      )

	    cat("  API cross-check complete\n")
	    api_summary
	  },
	  error = function(e) {
	    message("API cross-check skipped: ", conditionMessage(e))
	    NULL
	  }
	)

	# 04 – Higher education institutions (UKÄ list, hardcoded) ---------------------
	# Source: UKÄ register of accredited Swedish higher education institutions.
	# Each row = one institution × one municipality (multi-campus → multiple rows).
	# Verify against: https://www.uka.se/om-oss/kontakt/larosaetenas-webbplatser.html

	he_institutions <- tribble(
	  ~institution                                  , ~muni_code , ~type_he             ,
	  # ---- State universities ----
	  "Uppsala University"                          , "0380"     , "university"         ,
	  "Stockholm University"                        , "0180"     , "university"         ,
	  "Lund University"                             , "1281"     , "university"         ,
	  "University of Gothenburg"                    , "1480"     , "university"         ,
	  "Umeå University"                             , "2480"     , "university"         ,
	  "Linköping University"                        , "0580"     , "university"         ,
	  "Örebro University"                           , "1880"     , "university"         ,
	  "Karlstad University"                         , "1780"     , "university"         ,
	  # ---- State specialised universities ----
	  "KTH Royal Institute of Technology"           , "0180"     , "university"         ,
	  "Karolinska Institutet"                       , "0184"     , "university"         , # Solna
	  "Chalmers University of Technology"           , "1480"     , "university"         , # private, state-grant
	  "SLU – Uppsala"                               , "0380"     , "university"         ,
	  "SLU – Umeå"                                  , "2480"     , "university"         ,
	  "SLU – Alnarp (Lomma)"                        , "1262"     , "university"         ,
	  "SLU – Skara"                                 , "1495"     , "university"         ,
	  # ---- State university colleges ----
	  "Blekinge Institute of Technology"            , "1080"     , "university_college" , # Karlskrona
	  "Dalarna University – Falun"                  , "2080"     , "university_college" ,
	  "Dalarna University – Borlänge"               , "2081"     , "university_college" ,
	  "University of Gävle"                         , "2180"     , "university_college" ,
	  "Halmstad University"                         , "1380"     , "university_college" ,
	  "Kristianstad University"                     , "1290"     , "university_college" ,
	  "Linnaeus University – Växjö"                 , "0780"     , "university_college" ,
	  "Linnaeus University – Kalmar"                , "0880"     , "university_college" ,
	  "Malmö University"                            , "1280"     , "university_college" ,
	  "Mälardalen University – Västerås"            , "1980"     , "university_college" ,
	  "Mälardalen University – Eskilstuna"          , "0484"     , "university_college" ,
	  "Mid Sweden University – Sundsvall"           , "2281"     , "university_college" ,
	  "Mid Sweden University – Östersund"           , "2380"     , "university_college" ,
	  "Södertörn University"                        , "0126"     , "university_college" , # Huddinge
	  "University of Borås"                         , "1490"     , "university_college" ,
	  "University of Skövde"                        , "1496"     , "university_college" ,
	  "University West"                             , "1488"     , "university_college" , # Trollhättan
	  # ---- Private accredited institutions ----
	  "Stockholm School of Economics"               , "0180"     , "university_college" ,
	  "Jönköping University"                        , "0680"     , "university_college" ,
	  # ---- Art, music, design, sport ----
	  "Konstfack"                                   , "0180"     , "university_college" ,
	  "Royal University College of Music (KMH)"     , "0180"     , "university_college" ,
	  "Stockholm University of the Arts"            , "0180"     , "university_college" ,
	  "Royal Institute of Art"                      , "0180"     , "university_college" ,
	  "Beckmans College of Design"                  , "0180"     , "university_college" ,
	  "Swedish School of Sport and Health Sciences" , "0180"     , "university_college" ,
	  # ---- Defence / health ----
	  "Swedish Defence University"                  , "0180"     , "university_college" ,
	  "Sophiahemmet University"                     , "0180"     , "university_college" ,
	  "Ersta Sköndal Bräcke University College"     , "0180"     , "university_college" ,
	  "Röda Korsets Högskola"                       , "0180"     , "university_college" ,
	  "Newmaninstitutet"                            , "0380"     , "university_college"
	)

	he_counts <- he_institutions |>
	  count(muni_code, type_he, name = "n_total") |>
	  rename(type = type_he) |>
	  # All Swedish HE institutions are state-funded or receive >90% public funding;
	  # public/private distinction used for school units does not apply here.
	  mutate(n_public = n_total, n_private = 0L, muni_name = NA_character_)

	# 05 – Combine all sources and reshape to wide ---------------------------------

	long <- bind_rows(
	  unit_counts,
	  he_counts
	)

	# Load the municipality reference to fill in any missing names and ensure
	# all 290 m_sample municipalities appear (with 0s for absent institution types)
	munis <- readRDS("m_sample.rds") |>
	  select(muni_code = code, muni_name_ref = municipality) |>
	  mutate(muni_code = str_pad(as.character(muni_code), 4, "left", "0"))

	all_types <- unique(long$type)

	wide <- munis |>
	  cross_join(tibble(type = all_types)) |>
	  left_join(
	    long |> select(muni_code, type, n_total, n_public, n_private),
	    by = c("muni_code", "type")
	  ) |>
	  mutate(
	    n_total = replace_na(n_total, 0L),
	    n_public = replace_na(n_public, 0L),
	    n_private = replace_na(n_private, 0L)
	  ) |>
	  pivot_wider(
	    names_from = type,
	    values_from = c(n_total, n_public, n_private),
	    names_glue = "{type}_{.value}"
	  ) |>
	  rename(municipality = muni_name_ref, code = muni_code)

	# 06 – Save --------------------------------------------------------------------

	write_rds(wide, "edu_offer.rds")
	write_csv(wide, "edu_offer.csv")
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

#  Merging edu_offer and m_sample into m_sample_enriched.

m_sample_enriched <- read_rds("m_sample.rds") |>
  mutate(code = str_pad(as.character(code), 4, "left", "0")) |>
  left_join(wide |> select(-municipality), by = "code")

write_rds(m_sample_enriched, "m_sample.rds")
```

Once that is done, and our dataset is ready, we can do the CA again by adding all the variables we need.

```{=latex}
\newpage
```

Here is the first biplot with the two first axis, that shows the 30 most important municipalities with their labels, as well as the 20 most contributive modalities.

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

sysfonts::font_add("CMU Serif", regular = "/home/clara/fonts/cmu/cmunrm.ttf", italic = "/home/clara/fonts/cmu/cmunoti.ttf", bold = "/home/clara/fonts/cmu/cmunrb.ttf", bolditalic = "/home/clara/fonts/cmu/cmunbi.ttf")
showtext::showtext_auto()
showtext::showtext_opts(dpi = 300)
theme_set(theme_gray(base_family = "CMU Serif"))

muni_2022 <- m_sample_enriched

# Recoding for the housing variables
muni_2022 <- muni_2022 |>
  mutate(`Rented dwellings_total` = `Rented dwellings_one- or two-dwelling buildings` + `Rented dwellings_multi-dwelling buildings` + `Rented dwellings_other buildings` + `Rented dwellings_special housing`,`Tenant-owned dwellings_total` = `Tenant-owned dwellings_one- or two-dwelling buildings` + `Tenant-owned dwellings_multi-dwelling buildings` + `Tenant-owned dwellings_other buildings` + `Tenant-owned dwellings_special housing`, `Owner-occupied dwellings_total` = `Owner-occupied dwellings_one- or two-dwelling buildings` + `Owner-occupied dwellings_multi-dwelling buildings` + `Owner-occupied dwellings_other buildings` + `Owner-occupied dwellings_special housing`)

# Active variables
vars_active <- c(
  # Education
  "Education level_primary_secondary", "Education level_upper_secondary",
  "Education level_post-secondary", "Education level_post-graduate",
  # Employment (16cat)
  "Employment_16cat_companies in agriculture, forestry and fishing",
  "Employment_16cat_mining, quarrying, manufacturing",
  "Employment_16cat_energy and environmental companies",
  "Employment_16cat_construction industry",
  "Employment_16cat_trade", "Employment_16cat_transport and storage companies",
  "Employment_16cat_hotels and restaurants",
  "Employment_16cat_information and communication companies",
  "Employment_16cat_financial institutions and insurance companies",
  "Employment_16cat_real estate companies",
  "Employment_16cat_professional, scientific and technical companies; administrative and support service companies", "Employment_16cat_public authorities and national defence",
  "Employment_16cat_educational establishments",
  "Employment_16cat_human health and social work establishments",
  "Employment_16cat_establishments for arts, entertainment and recreation; other service companies etc.", "Employment_16cat_unknown activity",
  # Housing
  "Rented dwellings_total", "Tenant-owned dwellings_total", "Owner-occupied dwellings_total",
  # Workplace
  "Workplace_Commuters coming into the municipality", "Workplace_Commuters leaving the municipality",
  "Workplace_Working and living in the municipality",
  # Migrations
  "Number of inmigrations", "Number of outmigrations",
  # Retirees
  "Number of retirees",
  # Localities
  "Number of localities",
  # Institutions - total
  "forskoleklass_n_total", "grundskola_n_total", "anpassad_grundskola_n_total",
  "specialskola_n_total", "sameskola_n_total", "gymnasieskola_n_total",
  "anpassad_gymnasieskola_n_total", "komvux_n_total", "sfi_n_total", "university_n_total",
  "university_college_n_total",
  # Institutions - public
  "forskoleklass_n_public", "grundskola_n_public", "anpassad_grundskola_n_public",
  "specialskola_n_public", "sameskola_n_public", "gymnasieskola_n_public",
  "anpassad_gymnasieskola_n_public", "komvux_n_public", "sfi_n_public",
  "university_n_public", "university_college_n_public",
  # Institutions - private
  "forskoleklass_n_private", "grundskola_n_private", "anpassad_grundskola_n_private",
  "specialskola_n_private", "sameskola_n_private", "gymnasieskola_n_private",
  "anpassad_gymnasieskola_n_private", "komvux_n_private", "sfi_n_private",
  "university_n_private", "university_college_n_private")

# Supplementary variables
vars_supp <- c(
  "Opinion on preschool_Bad (%)", "Opinion on preschool_Mid (%)", "Opinion on preschool_Good (%)",
  "Opinion on elementary school_Bad (%)", "Opinion on elementary school_Mid (%)",
  "Opinion on elementary school_Good (%)", "Opinion on highschool_Bad (%)",
  "Opinion on highschool_Mid (%)", "Opinion on highschool_Good (%)")

vars_all <- c(vars_active, vars_supp)
tab <- muni_2022[, c("municipality", vars_all)]
tab <- column_to_rownames(tab, var = "municipality")
tab <- as.matrix(tab)
tab[is.na(tab)] <- 0
supp_idx <- (length(vars_active) + 1):ncol(tab)

# Rename the modalities
colnames(tab) <- c(
  # Education
  "Primary & lower secondary", "Upper secondary", "Post-secondary", "Post-graduate",
  # Employment in 16 categories
  "Agriculture, forestry, fishing", "Mining & manufacturing", "Energy & environment",
  "Construction", "Trade", "Transport & storage", "Hotels & restaurants",
  "IT & communication", "Finance & insurance", "Real estate",
  "Professional & technical services", "Public authorities & defence",
  "Educational establishments", "Health & social work", "Arts, entertainment & recreation",
  "Unknown activity",
  # Housing
  "Rented dwellings", "Tenant-owned dwellings", "Owner-occupied dwellings",
  # Workplace
  "Commuters in", "Commuters out", "Working & living in municipality",
  # Migrations & other
  "Inmigrations", "Outmigrations", "Retirees", "Localities",
  # Institution types - total
  "Preschool class", "Primary school", "Adapted primary", "Special school", "Sami school",
  "Secondary school", "Adapted secondary", "Komvux", "SFI", "University", "University college",
  # Institution types - public
  "Public preschool class", "Public primary school", "Public adapted primary", "Public pecial school",
  "Public sami school", "Public secondary school", "Public adapted secondary", "Public omvux",
  "Public SFI", "Public university", "Public university college",
  # Institution types - private
  "Private preschool class", "Private primary school", "Private adapted primary",
  "Private special school", "Private sami school", "Private secondary school",
  "Private adapted secondary", "Private komvux", "Private SFI", "Private university",
  "Private university college",
  # Supplementary
  "Bad opinion on preschool", "Mid opinion on preschool", "Good opinion on preschool",
  "Bad opinion on Elementary school", "Mid opinion on elementary school",
  "Good opinion on elementary school", "Bad opinion on highschool", "Mid opinion on highschool",
  "Good opinion on highschool")
afc2 <- CA(tab, col.sup = supp_idx, graph = FALSE)

# First biplot, with 30 municipalities and 20 variables / modalities
fviz_ca_biplot(afc2, repel = TRUE, col.col = "red", select.col = list(contrib = 20),
               select.row = list(contrib = 30), font.family = "CMU Serif",
               title = "CA of Municipalities n°3 (2022): Biplot n°1") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5))
```

```{=latex}
\noindent\
```

The first axis explains 48.2 % of the total inertia, which is less than the two previous CAs. This makes sense as this CA is constructed upon more variables: the space is thus more multidimensional. On the left side of this axis, we find modalities such as "Agriculture, forestry, fishing", "Owner-occupied dwellings", "Construction", "Mining and manufacturing", "Primary and lower secondary", "Upper secondary", "Retirees" or "Health and social work" ; while on the right side, we find "Finance and insurance", "IT and communication", "Post-graduate", "Tenant-owned dwellings" or "Post-secondary". This axis seems structured around educational as well as economic and social inequalities. Indeed, the left side of this axis seems less educated, and with an economy based on primary and secondary sector, while the right side seems more educated, orientated towards tertiary and knowledge-economy hubs, and with a better capacity to attract commuters. When we look at the municipalities positioned on each side of the axis, this analysis is confirmed: we find northern (e.g.: Skellefteå), rural (e.g.: Gotland) or industrial municipalities (e.g.: Örnsköldsvik), and they are located on the left side on the axis ; while on the right side, cities such as Stockholm, Uppsala, Lund or Göteborg, Swedish metropolis, are university cities with very educated inhabitants, and cities like Mölndal, Nacka, Danderyd or Lidingö, are wealthy suburbs near Stockholm or Göteborg. In particular, Solna is interesting because it is an employment hub, with a great commuters attraction capacity, which explains its position on the very end of the axis.

While the first axis is mainly structured around employment sectors, commuters attraction, rural/urban opposition as well as educational levels, the second axis seems to introduce a distinction within the most educated and wealthy municipalities. Indeed, the municipalities that are positioned at the extremities of the second axis are all positioned on the right side of the first axis, indicating that the wealthy and educated cities are the one structured around this axis. At the top of the axis, we find the modality "Commuters out", as well as suburban municipalities, whether it is Stockholm's suburbs (Danderyd, Sollentuna, Sundbyberg, Upplands Väsby, Tyresö...) or Göteborg's (Lerum, Härryda, Partille, Mölndal...). These are residential cities, close to bigger municipalities, where commuters go to work. However, on the bottom part of the second axis, we find the modalities "Working and living in municipality" as well as "Rented dwellings", which indicates more independent economies, whether it is because they are structured around a university and / or a major Swedish city (right side of the axis n°1: Lund, Stockholm, Uppsala, Göteborg) or because they are geographically isolated (left side of the axis n°1: Skellefteå, Gotland or Örnsköldsvik).

In sum, the first axis represents a metropolization gradient, and opposes rich, urban and educated municipalities with a highly mobile workforce, to less educated, more rural and industrial cities. The second axis on the other hand represents a functional dependency gradient, as it opposes wealthy suburbs where commuters work in bigger cities, to university or northern industrial – two very different types of cities, but that are both more independent than suburban cities.

```{=latex}
\newpage
```

Here is a second biplot of the two first dimensions, that shows every municipality, without their labels, and the 20 most contributive modalities. This helps visualize the global distribution of municipalities within the space of education, and not only the 30 most important municipalities. I also added the supplementary variables, displayed in green.

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

# Manueal extraction of the supplementary variables coordinates
supp_coords <- as.data.frame(afc2$col.sup$coord) |>
  rownames_to_column("variable")

# Biplot 2: every municipality without label, 20 most contributive modalities and supplementary variables manually added
fviz_ca_biplot(afc2, repel = TRUE, label = "col", select.col = list(contrib = 20),
               col.col = "red", invisible = "col.sup",
               font.family = "CMU Serif", title = "CA of Municipalities n°3 (2022): Biplot n°2") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5)) +
  geom_point(data = supp_coords, aes(x = `Dim 1`, y = `Dim 2`),
             color = "darkgreen", shape = 17, size = 2) +
  geom_text_repel(data = supp_coords, aes(x = `Dim 1`, y = `Dim 2`, label = variable),
                  color = "darkgreen", family = "CMU Serif", size = 3)
```

```{=latex}
\noindent\
```

This second biplot shows two new elements. First, it evidences the overall distribution of the municipalities within the space of education. It shows continuity between the municipality: we do not see distinct clusters, but rather a continuous distribution along the axes. More precisely, the distribution seems to have a half-moon form: on the right side of graph, there are only municipalities at the top and at the bottom (extremities of the second axis) while on the left side of the graph, the municipalities are more concentrated at the center, and at the bottom as well. This confirms our analysis: the second axis distinguishes the rich and urban municipalities, i.e. the right side of the graph.

```{=latex}
\newpage
```

The other interesting element here is the position of the supplementary variables: they are all located at the top left hand corner of the graph. At first, I thought it indicated that the opinions on education were homogeneous within the Swedish society. But after looking at the square cos2 values of the supplementary variables' modalities, it seems to be rather a problem of quality of representation on the axis. Indeed, they range from 0.06 to 0.19 on axis n°1, and from 0.02 to 0.10 on axis n°2), which means they are poorly represented on these two axis. This means that opinions on education vary according to other dimensions that the ones captured by this CA.

```{r}
#| echo: false
#| message: false
#| warning: false

cos2_supp <- as.data.frame(afc2$col.sup$cos2)[, 1:2]
rownames(cos2_supp) <- c(
  "Bad opinion on preschool", "Mid opinion on preschool", "Good opinion on preschool",
  "Bad opinion on elementary school", "Mid opinion on elementary school",
  "Good opinion on elementary school", "Bad opinion on highschool", "Mid opinion on highschool",
  "Good opinion on highschool")
kable(cos2_supp, digits = 3, caption = "Cos2 of the supplementary variables on the first two dimensions of the CA n°3")
```

```{=latex}
\noindent\
```

Now, let's assign a color to each region, and color the municipalities depending on the county they belong to, to see if there are counties regularities. After extracting the county of each municipalities thanks to the municipality code variable, we assigned one of the three historical regions to each city:

- **Götaland** (south): Östergötland (05), Jönköping (06), Kronoberg (07), Kalmar (08), Gotland (09), Blekinge (10), Skåne (12), Halland (13), Västra Götaland (14).

- **Svealand** (centre): Stockholm (01), Uppsala (03), Södermanland (04), Värmland (17), Örebro (18), Västmanland (19), Dalarna (20).

- **Norrland** (north): Gävleborg (21), Västernorrland (22), Jämtland (23), Västerbotten (24), Norrbotten (25).

On this next graph, the color indicates the region in which the municipality is, and the size of the dot indicates its contribution to the CA (a big dot meaning that the municipality contributes a lot).

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

# Creation of the region variable for the municipalities dataset
region_df <- m_sample_enriched %>%
  mutate(dept = substr(str_pad(as.character(code), 4, "left", "0"), 1, 2),
         region = case_when(dept %in% c("21", "22", "23", "24", "25") ~ "Norrland",
                            dept %in% c("01", "03", "04", "17", "18", "19", "20") ~ "Svealand",
                            dept %in% c("05", "06", "07", "08", "09", "10", "12", "13", "14")
                            ~ "Götaland", TRUE ~ "Unknown")) %>%
  arrange(municipality)

region_colors <- c("Götaland" = "#FFD700", "Svealand" = "#FF69B4", "Norrland" = "#00E676")

# The third biplot
fviz_ca_biplot(afc2, repel = TRUE, label = "col", select.col = list(contrib = 10),
               col.row = region_df$region, palette = region_colors,
               geom.ind = "point", pointsize = 3, alpha.ind = 0.7,
               col.col = "red", invisible = "col.sup", font.family = "CMU Serif",
               title = "CA of Municipalities n°3 (2022): Biplot n°3") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5)) +
  guides(color = guide_legend(title = "Region"), fill = "none", shape = "none")
```

This third biplots seems some regularities between regions, but it is not obvious at all.

First, yellow municipalities seems to appear on each part of the graph. This suggests a strong heterogeneity between Götaland municipalities. Then, green municipalities (Norrland) seem to be concentrated on the left side of the graph, which confirms the industrial and / or rural characteristic of the north. Finally, the pink municipalities (Svealand) seem to be more present at the top and appear on the top right side of the graph confirming the over-representation of Stockholm suburbs within wealthy municipalities, with a mobile workforce. But once again, these tendencies are not really obvious.

```{=latex}
\newpage
```

Finally, let's do a last biplot that displays the 30 most contributive modalities only, so that we can have an overview of more modalities, while still being able to read the graph. This is the reason why I did not display the municipalities here.

```{=latex}
\noindent\
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

fviz_ca_biplot(afc2, repel = TRUE, invisible = c("row", "col.sup"), col.col = "red",
               select.col = list(contrib = 30), font.family = "CMU Serif",
               title = "CA of Municipalities n°3 (2022): Biplot n°4") +
  theme_bw(base_family = "CMU Serif") +
  theme(text = element_text(family = "CMU Serif"), axis.title = element_text(family = "CMU Serif"),
        axis.text = element_text(family = "CMU Serif"),
        plot.title = element_text(family = "CMU Serif", hjust = 0.5))
```

This can help interpret the axis in a more finer way.