ruralitic-qrm/ppt/Presentation June 10.qmd

---
title: "Swedish counties"
author: "Laureline Brun"
format: pdf
toc: true
    include-in-header:
      text: |
        \usepackage{xcolor}
        \usepackage{makecell}
        \usepackage[most]{tcolorbox}
        \definecolor{bleupastel}{RGB}{210,230,255}
        \definecolor{jaunepale}{RGB}{255,255,200}
        \definecolor{violetpastel}{RGB}{230,210,250}
        \newtcolorbox{fluobox}{colback=bleupastel, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
        \newtcolorbox{zone4box}{colback=jaunepale, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
        \newtcolorbox{violetbox}{colback=violetpastel, colframe=black, boxrule=0.5mm, arc=2mm, left=2mm, right=2mm, top=1mm, bottom=1mm}
editor: visual
---

```{r}
#| echo: false
#| results: hide
#| message: false
#| warning: false

library(readxl)
library(writexl)
library(tibble)
library(FactoMineR)
library(factoextra)
library(showtext)
library(sysfonts)
library(tidyverse)
library(ggrepel)
library(ggplot2)
library(ggnewscale)
library(knitr)
library(explor)
library(kableExtra)

counties <- read_excel(
  "C:/Users/laure/Documents/Documents synchronisés/ENS - 2A/Suède/Internship/Counties_db.xlsx",
  col_types = c("numeric", "numeric", "text", rep("numeric", 212))
)
```

\newpage

For each one of those CA, we used frequency variables, i.e. a specific type of quantitative variables: variables that count the number of people in each category. So for example when we used educational levels, it is a frequency variable that counts the number of people in each category of educational attainment of the population.

# 1. CA n°1: Municipalities and educational levels

## 1. 1. Active variables for the CA n°1

For this first CA, we will use two active variables, and no supplementary variable. This CA is a preliminary one, to start visualizing the data and build upon it. The active variables and their modalities are:

-   \colorbox{bleupastel}{Educational levels}: primary and lower secondary school, upper secondary school (equivalent to high school), post-secondary (bachelor and masters) and post-graduate (doctoral studies).

-   \colorbox{bleupastel}{Counties}: the current administrative classification includes 21 Swedish counties.

```{=latex}
\newpage
```

## 1. 2. Two biplots with the first two axes

Let's do a CA with municipalities and educational levels. This first biplot displays every municipality, without labels, for more visibility, and every educational level.

\noindent

```{r}
#| echo: false
#| message: false
#| warning: false

#First I rename my columns and create counties_2022

names(counties)[names(counties) == "Education level"] <-
  "Education level_primary_secondary"
names(counties)[26] <- "Education level_upper_secondary"
names(counties)[27] <- "Education level_post_secondary"
names(counties)[28] <- "Education level_post_graduate"

names(counties)[names(counties) == "Workplaces"] <-
  "Workplace_Commuters coming into the municipality"
names(counties)[30] <- "Workplace_Commuters leaving the municipality"
names(counties)[31] <- "Workplace_Working and living in the municipality"

names(counties)[names(counties) == "Number of localities (urban areas)"] <-
  "Number of localities"

names(counties)[names(counties) == "Employment by activity sectors"] <-
  "Employment_Agriculture, forestry, hunting and fishing"
names(counties)[32] <- "Employment_Mining, quarrying and manufacturing"
names(counties)[33] <- "Employment_Electricity, gas and water supply, refuse disposal"
names(counties)[34] <- "Employment_Construction"
names(counties)[35] <- "Employment_Wholesale and retail trade; transport, storage and warehousing, post and telecommunications"
names(counties)[36] <- "Employment_Personal and cultural service activities"
names(counties)[37] <- "Employment_Financial institutions, real estate activities, business activities"
names(counties)[38] <- "Employment_Public authorities, national defence, extra-territorial organizations"
names(counties)[39] <- "Employment_Research and development, education"
names(counties)[40] <- "Employment_Health and social work establishments"
names(counties)[41] <- "Employment_Major groups missing"

names(counties)[names(counties) == "Opinion on primary school"] <-
  "Opinion on primary school_Good (%)"
names(counties)[156] <- "Opinion on primary school_Bad (%)"

names(counties)[names(counties) == "Opinion on high school"] <-
  "Opinion on highschool_Good (%)"
names(counties)[158] <- "Opinion on highschool_Bad (%)"

names(counties)[names(counties) == "Number of rented dwellings, by type of building"] <-
  "Rented dwellings_one- or two-dwelling buildings"
names(counties)[178] <- "Rented dwellings_multi-dwelling buildings"
names(counties)[179] <- "Rented dwellings_other buildings"
names(counties)[180] <- "Rented dwellings_special housing"

names(counties)[names(counties) == "Number of tenant-owned dwellings, by type of building"] <-
  "Tenant-owned dwellings_one- or two-dwelling buildings"
names(counties)[182] <- "Tenant-owned dwellings_multi-dwelling buildings"
names(counties)[183] <- "Tenant-owned dwellings_other buildings"
names(counties)[184] <- "Tenant-owned dwellings_special housing"

names(counties)[names(counties) == "Number of owner-occupied dwellings, by type of building"] <-
  "Owner-occupied dwellings_one- or two-dwelling buildings"
names(counties)[186] <- "Owner-occupied dwellings_multi-dwelling buildings"
names(counties)[187] <- "Owner-occupied dwellings_other buildings"
names(counties)[188] <- "Owner-occupied dwellings_special housing"

names(counties)[names(counties) == "Opinion on preschool"] <-
  "Opinion on preschool_Bad"
names(counties)[197] <- "Opinion on preschool_Mid"
names(counties)[198] <- "Opinion on preschool_Good"

names(counties)[names(counties) == "Opinion on elementary school"] <-
  "Opinion on elementary school_Bad"
names(counties)[200] <- "Opinion on elementary school_Mid"
names(counties)[201] <- "Opinion on elementary school_Good"

names(counties)[names(counties) == "Opinion on high school"] <-
  "Opinion on high school_Bad"
names(counties)[203] <- "Opinion on high school_Mid"
names(counties)[204] <- "Opinion on high school_Good"

names(counties)[names(counties) == "Equivalised disposable income (number of receivers)"] <-
  "Equivalised disposable income receivers_Quartile 1"
names(counties)[213] <- "Equivalised disposable income receivers_Quartile 2"
names(counties)[214] <- "Equivalised disposable income receivers_Quartile 3"
names(counties)[215] <- "Equivalised disposable income receivers_Quartile 4"


counties_2022 <- counties[counties$Year == 2022, ]

counties_2022 <- counties_2022[
  rowSums(is.na(counties_2022)) < ncol(counties_2022),
]
```

```{r}
#| echo: false
#| message: false
#| warning: false

tab <- counties_2022[, c(
  "County",
  "Education level_primary_secondary",
  "Education level_upper_secondary",
  "Education level_post_secondary",
  "Education level_post_graduate"
)]

tab <- column_to_rownames(tab, var = "County")

colnames(tab) <- c(
  "Primary and lower secondary",
  "Upper secondary",
  "Post-secondary",
  "Post-graduate"
)

tab <- as.matrix(tab)

afc <- CA(tab, graph = FALSE)

fviz_ca_biplot(
  afc,
  repel = TRUE,
  col.row = "blue",
  col.col = "red",
  label = "all",
  geom.ind = "text",
  geom.var = "text",
  title = "CA of Counties (2022): Biplot n°1"
) +
  theme_bw() +
  theme(
    text = element_text(size = 14),
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 18)
  )
```

\newpage

## 1. 3. Interpretation of the first two axes

First, we can say that the CA provides a good summary of the relationship between Swedish counties and levels of education, as the two first dimensions explain 96 % of the total inertia. It can also be observed that education is dependent on geography : the p-value is very low, which allows us to reject the hypothesis of independence between counties and education structure.

Now, let's try to interpret each axis. Dimension 1 represents a sort of educational attainment gradient, ranging from counties with lower educational attainment to counties with more advanced education profiles. It mostly opposes upper-secondary and post-secondary education. Stockholm's county contributes a lot to the first axis, it represents almost half of its inertia. According to the CA, Stockholm is strongly associated with post-secondary education. On the contrary, counties such as Södermanland, Jönköping, Kalmar, Kronoberg or Blekinge have a different educational profile, and are associated mostly with primary and upper-secondary education.

The second axis explains only about 7 % of the total inertia, so it captures something more secondary. It is mostly driven by post-graduate education, which has a contribution of 64 %. The second dimension seems to distinguish counties with a high concentration of advanced academic populations. The county that contributes the most to the second axis is Uppsala, which makes sense as it hosts one of Sweden's biggest universities. Västerbotten also contributes strongly.

\newpage

# 2. CA n°2: Housing, employment and education

## 2. 1. The 2022 sample

Let's carry out a more complete CA with many active indicators, using Pablo's sampling script that aims to produce a single clean cross-section anchored to 2022 (adapted for counties).

```{r}
#| echo: false
#| message: false
#| warning: false

# =============================================================================
#  Counties · 2022 cross-section for CA
# =============================================================================

library(tidyverse)

# -----------------------------------------------------------------------------
# 00 - Optional: save raw dataset
# -----------------------------------------------------------------------------

write_rds(counties, "counties_raw.rds")

# Load raw dataset
counties_raw <- read_rds("counties_raw.rds")

# -----------------------------------------------------------------------------
# 01 - Fix column names from Excel import
# -----------------------------------------------------------------------------

names(counties_raw)[names(counties_raw) == "Education level"] <-
  "Education level_primary_secondary"

names(counties_raw)[26] <- "Education level_upper_secondary"
names(counties_raw)[27] <- "Education level_post_secondary"
names(counties_raw)[28] <- "Education level_post_graduate"

# -----------------------------------------------------------------------------
# 02 - Characterise variable availability
# -----------------------------------------------------------------------------

availability_full <- counties_raw |>
  pivot_longer(
    cols = -c(Year, County),
    names_to = "variable",
    values_to = "value"
  ) |>
  filter(!is.na(value)) |>
  summarise(
    .by = variable,
    first_year = min(Year),
    last_year = max(Year)
  )

# -----------------------------------------------------------------------------
# 03 - Availability in backfill window (2020–2021)
# -----------------------------------------------------------------------------

availability_window <- counties_raw |>
  filter(between(Year, 2020, 2021)) |>
  pivot_longer(
    cols = -c(Year, County),
    names_to = "variable",
    values_to = "value"
  ) |>
  filter(!is.na(value)) |>
  summarise(
    .by = variable,
    window_last_year = max(Year)
  )

# -----------------------------------------------------------------------------
# 04 - Extract 2022 slice
# -----------------------------------------------------------------------------

counti_2022 <- counties_raw |>
  filter(Year == 2022)

# Remove fully empty rows imported from Excel
counti_2022 <- counti_2022[
  rowSums(is.na(counti_2022)) < ncol(counti_2022),
]

n_counties <- nrow(counti_2022)

# -----------------------------------------------------------------------------
# 05 - Count missing values in 2022
# -----------------------------------------------------------------------------

na_2022 <- counti_2022 |>
  summarise(across(-c(Year, County), \(x) sum(is.na(x)))) |>
  pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "n_na"
  ) |>
  mutate(
    pct_na = n_na / n_counties
  )

# -----------------------------------------------------------------------------
# 06 - Classify variables
# -----------------------------------------------------------------------------

variable_plan <- na_2022 |>
  left_join(availability_full, by = "variable") |>
  left_join(availability_window, by = "variable") |>
  mutate(
    action = case_when(

      # Complete in 2022
      n_na == 0 ~ "keep",

      # Partial NA but no window data
      n_na > 0 & n_na < n_counties & is.na(window_last_year) ~ "keep",

      # Window backfill possible
      n_na > 0 & !is.na(window_last_year) ~ "backfill_window",

      # Fully missing but series continues after 2022
      n_na == n_counties &
        is.na(window_last_year) &
        !is.na(last_year) &
        last_year > 2022 ~ "backfill_census",

      # Fully missing and discontinued before/at 2022
      n_na == n_counties &
        is.na(window_last_year) &
        !is.na(last_year) &
        last_year <= 2022 ~ "backfill_disc",

      # No data anywhere
      is.na(last_year) ~ "drop"
    ),

    fill_type = case_when(
      action == "backfill_window" & n_na == n_counties ~ "full_column",
      action == "backfill_window" & n_na < n_counties ~ "partial",
      action == "keep" & n_na > 0 ~ "residual_na",
      action == "backfill_census" ~ "census_closest",
      action == "backfill_disc" ~ "discontinued_last",
      TRUE ~ NA_character_
    )
  )

# -----------------------------------------------------------------------------
# 07 - Backfill A: 2020–2021 window
# -----------------------------------------------------------------------------

vars_window <- variable_plan |>
  filter(action == "backfill_window") |>
  pull(variable)

window_long <- counties_raw |>
  filter(between(Year, 2020, 2021)) |>
  select(Year, County, all_of(vars_window)) |>
  pivot_longer(
    all_of(vars_window),
    names_to = "variable",
    values_to = "value"
  ) |>
  filter(!is.na(value)) |>
  group_by(County, variable) |>
  slice_max(Year, n = 1, with_ties = FALSE) |>
  ungroup()

window_source_year <- window_long |>
  group_by(variable) |>
  summarise(
    source_year = max(Year),
    .groups = "drop"
  )

window_wide <- window_long |>
  select(-Year) |>
  pivot_wider(
    names_from = variable,
    values_from = value
  )

# -----------------------------------------------------------------------------
# 08 - Backfill B: closest year to 2022
# -----------------------------------------------------------------------------

vars_census <- variable_plan |>
  filter(action == "backfill_census") |>
  pull(variable)

census_long <- counties_raw |>
  filter(Year != 2022) |>
  select(Year, County, all_of(vars_census)) |>
  pivot_longer(
    all_of(vars_census),
    names_to = "variable",
    values_to = "value"
  ) |>
  filter(!is.na(value)) |>
  mutate(
    distance = abs(Year - 2022)
  ) |>
  group_by(County, variable) |>
  slice_min(distance, n = 1, with_ties = FALSE) |>
  ungroup()

census_source_year <- census_long |>
  group_by(variable) |>
  summarise(
    source_year = max(Year),
    .groups = "drop"
  )

census_wide <- census_long |>
  select(-Year, -distance) |>
  pivot_wider(
    names_from = variable,
    values_from = value
  )

# -----------------------------------------------------------------------------
# 09 - Backfill C: discontinued variables
# -----------------------------------------------------------------------------

vars_disc <- variable_plan |>
  filter(action == "backfill_disc") |>
  pull(variable)

disc_long <- counties_raw |>
  filter(Year < 2022) |>
  select(Year, County, all_of(vars_disc)) |>
  pivot_longer(
    all_of(vars_disc),
    names_to = "variable",
    values_to = "value"
  ) |>
  filter(!is.na(value)) |>
  group_by(County, variable) |>
  slice_max(Year, n = 1, with_ties = FALSE) |>
  ungroup()

disc_source_year <- disc_long |>
  group_by(variable) |>
  summarise(
    source_year = max(Year),
    .groups = "drop"
  )

disc_wide <- disc_long |>
  select(-Year) |>
  pivot_wider(
    names_from = variable,
    values_from = value
  )

# -----------------------------------------------------------------------------
# 10 - Apply all fills
# -----------------------------------------------------------------------------

counti_2022_filled <- counti_2022 |>
  rows_patch(window_wide, by = "County", unmatched = "ignore") |>
  rows_patch(census_wide, by = "County", unmatched = "ignore") |>
  rows_patch(disc_wide, by = "County", unmatched = "ignore")

# -----------------------------------------------------------------------------
# 11 - Remove empty variables and redundant Year column
# -----------------------------------------------------------------------------

vars_drop <- variable_plan |>
  filter(action == "drop") |>
  pull(variable)

counties_sample <- counti_2022_filled |>
  select(-all_of(vars_drop), -Year)

# -----------------------------------------------------------------------------
# 12 - Residual missing values audit
# -----------------------------------------------------------------------------

na_remaining <- counties_sample |>
  summarise(across(-County, \(x) sum(is.na(x)))) |>
  pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "n_na_final"
  ) |>
  filter(n_na_final > 0)

# -----------------------------------------------------------------------------
# 13 - Build audit table
# -----------------------------------------------------------------------------

source_years <- bind_rows(
  window_source_year,
  census_source_year,
  disc_source_year
)

sampling_audit <- variable_plan |>
  left_join(source_years, by = "variable") |>
  left_join(na_remaining, by = "variable") |>
  mutate(
    n_na_final = replace_na(n_na_final, 0L),

    years_from_2022 = if_else(
      !is.na(source_year),
      abs(source_year - 2022L),
      NA_integer_
    ),

    note = case_when(

      action == "keep" & is.na(fill_type) ~
        "Complete in 2022",

      fill_type == "residual_na" ~
        paste0(
          n_na,
          " counties (",
          round(pct_na * 100, 1),
          "%) have no 2022 value and no 2020–2021 data"
        ),

      action == "drop" ~
        "Dropped — no data anywhere in the panel",

      fill_type == "full_column" ~
        paste0(
          "Entire column backfilled from ",
          source_year
        ),

      fill_type == "partial" ~
        paste0(
          round(pct_na * 100, 1),
          "% missing in 2022; filled from ≤ ",
          source_year
        ),

      fill_type == "census_closest" ~
        paste0(
          "Periodic variable backfilled from ",
          source_year
        ),

      fill_type == "discontinued_last" ~
        paste0(
          "Discontinued series backfilled from ",
          source_year
        )
    )
  ) |>
  select(
    variable,
    action,
    fill_type,
    n_na_2022 = n_na,
    pct_na_2022 = pct_na,
    n_na_final,
    first_year,
    last_year,
    source_year,
    years_from_2022,
    note
  ) |>
  arrange(
    action,
    fill_type,
    desc(years_from_2022)
  )

# -----------------------------------------------------------------------------
# 14 - Save outputs
# -----------------------------------------------------------------------------

write_rds(counties_sample, "counties_sample.rds")

write_csv(
  sampling_audit,
  "counties_sampling_audit.csv"
)
```

Here is a more complete CA, with education, employment and housing variables as active, and with opinion variables about education as supplementary.

```{r}
#| echo: false
#| message: false
#| warning: false

# =============================================================================
# Use of the counties sampling
# =============================================================================

counti_2022 <- counties_sample

# =============================================================================
# Recoding section
# =============================================================================

names(counti_2022)[names(counti_2022) == "Employment by activity sectors"] <-
  "Employment_Agriculture, forestry, hunting and fishing"
names(counti_2022)[32]  <- "Employment_Mining, quarrying and manufacturing"
names(counti_2022)[33]  <- "Employment_Electricity, gas and water supply, refuse disposal"
names(counti_2022)[34]  <- "Employment_Construction"
names(counti_2022)[35]  <- "Employment_Wholesale and retail trade; transport, storage and warehousing, post and telecommunications"
names(counti_2022)[36]  <- "Employment_Personal and cultural service activities"
names(counti_2022)[37]  <- "Employment_Financial institutions, real estate activities, business activities"
names(counti_2022)[38]  <- "Employment_Public authorities, national defence, extra-territorial organizations"
names(counti_2022)[39]  <- "Employment_Research and development, education"
names(counti_2022)[40]  <- "Employment_Health and social work establishments"
names(counti_2022)[41]  <- "Employment_Major groups missing"

names(counti_2022)[names(counti_2022) == "Opinion on primary school"] <-
  "Opinion on primary school_Good (%)"
names(counti_2022)[156] <- "Opinion on primary school_Bad (%)"
names(counti_2022)[names(counti_2022) == "Opinion on high school"] <-
  "Opinion on highschool_Good (%)"
names(counti_2022)[158] <- "Opinion on highschool_Bad (%)"

names(counti_2022)[names(counti_2022) == "Number of rented dwellings, by type of building"] <-
  "Rented dwellings_one- or two-dwelling buildings"
names(counti_2022)[178] <- "Rented dwellings_multi-dwelling buildings"
names(counti_2022)[179] <- "Rented dwellings_other buildings"
names(counti_2022)[180] <- "Rented dwellings_special housing"

names(counti_2022)[names(counti_2022) == "Number of tenant-owned dwellings, by type of building"] <-
  "Tenant-owned dwellings_one- or two-dwelling buildings"
names(counti_2022)[182] <- "Tenant-owned dwellings_multi-dwelling buildings"
names(counti_2022)[183] <- "Tenant-owned dwellings_other buildings"
names(counti_2022)[184] <- "Tenant-owned dwellings_special housing"

names(counti_2022)[names(counti_2022) == "Number of owner-occupied dwellings, by type of building"] <-
  "Owner-occupied dwellings_one- or two-dwelling buildings"
names(counti_2022)[186] <- "Owner-occupied dwellings_multi-dwelling buildings"
names(counti_2022)[187] <- "Owner-occupied dwellings_other buildings"
names(counti_2022)[188] <- "Owner-occupied dwellings_special housing"

# =============================================================================
# Housing totals
# =============================================================================

counti_2022 <- counti_2022 |>
  mutate(

    `Rented dwellings_total` =
      `Rented dwellings_one- or two-dwelling buildings` +
      `Rented dwellings_multi-dwelling buildings` +
      `Rented dwellings_other buildings` +
      `Rented dwellings_special housing`,

    `Tenant-owned dwellings_total` =
      `Tenant-owned dwellings_one- or two-dwelling buildings` +
      `Tenant-owned dwellings_multi-dwelling buildings` +
      `Tenant-owned dwellings_other buildings` +
      `Tenant-owned dwellings_special housing`,

    `Owner-occupied dwellings_total` =
      `Owner-occupied dwellings_one- or two-dwelling buildings` +
      `Owner-occupied dwellings_multi-dwelling buildings` +
      `Owner-occupied dwellings_other buildings` +
      `Owner-occupied dwellings_special housing`
  )

# =============================================================================
# Active variables
# =============================================================================

vars_active <- c(

  # Education
  "Education level_primary_secondary",
  "Education level_upper_secondary",
  "Education level_post_secondary",
  "Education level_post_graduate",

  # Employment
  "Employment_Agriculture, forestry, hunting and fishing",
  "Employment_Mining, quarrying and manufacturing",
  "Employment_Electricity, gas and water supply, refuse disposal",
  "Employment_Construction",
  "Employment_Wholesale and retail trade; transport, storage and warehousing, post and telecommunications",
  "Employment_Personal and cultural service activities",
  "Employment_Financial institutions, real estate activities, business activities",
  "Employment_Public authorities, national defence, extra-territorial organizations",
  "Employment_Research and development, education",
  "Employment_Health and social work establishments",

  # Housing
  "Rented dwellings_total",
  "Tenant-owned dwellings_total",
  "Owner-occupied dwellings_total"
)

# =============================================================================
# Supplementary variables
# =============================================================================

vars_supp <- c(
  "Opinion on primary school_Bad (%)",
  "Opinion on primary school_Good (%)",
  "Opinion on highschool_Bad (%)",
  "Opinion on highschool_Good (%)"
)

# =============================================================================
# Build CA table
# =============================================================================

vars_all <- c(vars_active, vars_supp)

tab <- counti_2022[, c("County", vars_all)]

tab <- column_to_rownames(
  tab,
  var = "County"
)

tab <- as.matrix(tab)

# Replace remaining NA values by 0
tab[is.na(tab)] <- 0

# Indices of supplementary variables
supp_idx <- (length(vars_active) + 1):ncol(tab)

# Clean variable names for plotting
colnames(tab) <- ifelse(
  grepl("_total$", colnames(tab)),
  gsub("_total$", "", colnames(tab)),
  gsub("^[^_]+_", "", colnames(tab))
)

# =============================================================================
# Correspondence Analysis
# =============================================================================

afc <- CA(
  tab,
  col.sup = supp_idx,
  graph = FALSE
)

# =============================================================================
# First biplot
# =============================================================================

fviz_ca_biplot(
  afc,
  repel = TRUE,
  col.col = "red",
  select.col = list(contrib = 10),
  select.row = list(contrib = 30),
  title = "CA of Counties (2022): Biplot 1"
) +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# =============================================================================
# Second biplot
# =============================================================================

fviz_ca_biplot(
  afc,
  repel = TRUE,
  label = "col",
  select.col = list(contrib = 20),
  col.col = "red",
  col.col.sup = "darkgreen",
  invisible = "none",
  title = "CA of Counties (2022): Biplot 2"
) +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
```

Let's interpret the CA. First, just like for the first CA, the p-value allows us to conclude that there is a link between the geographical context (the counties) and the variables we gathered. The first axis explains 84 % of the total inertia, which means it summarizes almost the entire structure of the dataset. The data is one-dimensional.

Stockholm is basically defining the first axis, which makes sense as it is a big part of the dataset. It has a contribution of 61 % on this axis. It influences the whole CA heavily.

The first axis opposes metropolitan counties, that are highly educated and economically complex (on the right) like Stockholm with counties that are more peripheral and less economically diverse. We can see that post-secondary and post-graduate education are on the right, just like financial activities, while agriculture, manufacturing and upper-secondary are more on the left.

The second axis is less clear, which makes it hard to interpret it.

We also have to keep in mind that some counties are better represented than others : Stockholm, Kalmar and Blekinge all have high (\>0,8) cos² but it is not the case for Uppsala or Gotland for instance, even though they contribute quite a lot to the CA.

# 3. CA n°3: The final CA

## 3. 1. Active and supplementary variables for the CA n°3

This first CA is a good start, but we need to add more indicators. Here is the list of active variables of this new CA, including the variables of the previous one:

-   \colorbox{bleupastel}{Educational levels}: primary and lower secondary school, upper secondary school (equivalent to high school), post-secondary (bachelor and masters) and post-graduate (doctoral studies).

-   \colorbox{bleupastel}{Counties}: the current administrative classification includes 21 Swedish counties.

-   \colorbox{bleupastel}{Housing types}: number of people in rented dwellings, tenant-owned dwellings and owner-occupied dwellings.

-   \colorbox{bleupastel}{Retirees}: number of retired people.

-   \colorbox{bleupastel}{Workplace}: commuters coming in the municipality to work, commuters going out of the municipality to work, commuters who leave and work in the same municipality.

-   \colorbox{bleupastel}{Employment by activity sector}: agriculture, forestry and fishing ; mining, quarrying, manufacturing ; energy and environmental companies ; construction industry ; trade ; transport and storage companies ; hotels and restaurants ; information and communication companies ; financial institutions and insurance companies ; real estate companies ; professional, scientific and technical companies, administrative and support service companies ; public authorities and national defense ; educational establishments ; human health and social work establishments ; establishments for arts, entertainment and recreation, other service companies ; unknown activity

-   \colorbox{bleupastel}{Outmigrations}: number of people leaving the municipality during the year.

-   \colorbox{bleupastel}{Inmigrations}: number of people coming into the municipality during the year.

-   \colorbox{bleupastel}{Localities}: number of urban areas. That is the only frequency variable that can approximate the proximity to a big city (it is not the same indicator, but it still indicates to a certain extent the degree of urbanity of the municipality).

\newpage

As supplementary variables, we have:

-   \colorbox{jaunepale}{Institution types}: Förskoleklass (Preschool class), Grundskola (Primary school), Anpassad grundskola (Adapted primary school), Specialskola (Special school), Sameskola (Sami school), Gymnasieskola (Secondary school), Anpassad gymnasieskola (Adapted secondary school), Komvux (Adult education), SFI (Swedish for immigrants), University, University college. And for each type, three measurements: number of institutions, number of public institutions and number of private institutions.

-   \colorbox{jaunepale}{Opinion on preschool}: bad opinion, mid opinion, good opinion.

-   \colorbox{jaunepale}{Opinion on elementary school}: bad opinion, mid opinion, good opinion.

-   \colorbox{jaunepale}{Opinion on high school}: bad opinion, mid opinion, good opinion.

-   \colorbox{jaunepale}{Capital income}: number of people who receive capital income.

-   \colorbox{jaunepale}{Capital income (interests, dividends, fixed income, etc.)}: number of people who receive capital income (interests, dividends, fixed income, etc.).

-   \colorbox{jaunepale}{Capital income (selling plus-value)}: number of people who receive capital income (selling plus-value).

-   \colorbox{jaunepale}{Equivalised disposable income}: number of people with equivalised disposable income, according to the quartile they are in (quartile 1, 2, 3 or 4).

NB: The economic variables were put as supplementary because they distorted the axes too much.

\newpage

We then need to scrape the institutions types variables, thanks to Pablo's script, and add them to the 2022 sample. Here is how it works:

\noindent\
\noindent

```{r}
# ===========================================================================
#  skolverket.R  ·  Educational offer dataset by municipality
# ===========================================================================
	#
	#  Sources and coverage:
	#   1. data/skolenhetsadresser.xlsx
	#      Skolverket school unit registry
	#      (current snapshot; all Skolverket-regulated institution types)
	#      Sheets: Förskoleklass, Grundskola, Anpassad grundskola,
	#      Specialskola, Sameskola, Gymnasieskola, Anpassad gymnasieskola,
	#      Komvux
	#   2. Hardcoded list
	#      higher education institutions
	#      Source: UKÄ (Universitetskanslersämbetet) register,
	#      ~40 institutions. Each institution-municipality pair is one row;
	#      multi-campus institutions appear under every municipality that
	#      hosts a campus.
	#   3. TODO: Yrkeshögskola
	#      MYH (Myndigheten för yrkeshögskolan)
	#      open data at myh.se; no REST API identified so far.
	#   4. TODO: Folkhögskola
	#      Folkbildningsrådet register at
	#      folkbildning.se; ~155 institutions across ~100+ municipalities.
	#
	#  Note: the Skolverket planned-educations API (api.skolverket.se) was
	#  explored but covers only the same types as the xlsx; it is used here
	#  purely as an optional check in section 03.
	#
	#  Output:
	#    data/processed/edu_offer.rds   :  municipality × indicator/count
	#    (wide)
	#    data/processed/edu_offer.csv   :  same, plain text
```

The script I am trying to adapt for counties (instead of municipalities) :

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

	# 00 – Helpers -----------------------------------------------------------------

XLSX_PATH <- "skolenhetsadresser.xlsx"

	# Standardise municipality code column to 4-char zero-padded string
	add_muni_code <- function(df) {
	  df |>
	    rename(
	      muni_code = `BELÄGEN I KOMMUN (KOD)`,
	      muni_name = `BELÄGEN I KOMMUN (NAMN)`
	    ) |>
	    mutate(muni_code = str_pad(as.character(muni_code), 4, "left", "0"))
	}

	# Public = Kommunal or Region (state-level public body); else private
	categorise_ownership <- function(df) {
	  mutate(
	    df,
	    ownership = if_else(
	      HUVUDMANNATYP %in% c("Kommunal", "Region", "Statlig"),
	      "public",
	      "private"
	    )
	  )
	}

	# Count units by municipality and ownership, then pivot to n_public / n_private / n_total
	count_units <- function(df, type_label) {
	  pivoted <- df |>
	    add_muni_code() |>
	    categorise_ownership() |>
	    count(muni_code, muni_name, ownership) |>
	    pivot_wider(names_from = ownership, values_from = n, values_fill = 0L)
	  # Ensure both columns exist even if one ownership type is absent
	  if (!"public" %in% names(pivoted)) {
	    pivoted$public <- 0L
	  }
	  if (!"private" %in% names(pivoted)) {
	    pivoted$private <- 0L
	  }
	  pivoted |>
	    mutate(
	      n_total = public + private,
	      type = type_label,
	      n_public = public,
	      n_private = private
	    ) |>
	    select(muni_code, muni_name, type, n_total, n_public, n_private)
	}

	# 01 – Read all xlsx sheets ----------------------------------------------------

	sheet_map <- c(
	  "forskoleklass" = "Förskoleklass",
	  "grundskola" = "Grundskola",
	  "anpassad_grundskola" = "Anpassad grundskola",
	  "specialskola" = "Specialskola",
	  "sameskola" = "Sameskola",
	  "gymnasieskola" = "Gymnasieskola",
	  "anpassad_gymnasieskola" = "Anpassad gymnasieskola",
	  "komvux" = "Komvux"
	)

	raw <- imap(sheet_map, \(sheet_name, type_label) {
	  cat("Reading sheet:", sheet_name, "\n")
	  read_excel(XLSX_PATH, sheet = sheet_name)
	})

	# 02 – Count institutions per municipality and ownership -----------------------

	unit_counts <- imap_dfr(raw, \(df, type_label) count_units(df, type_label))

	# Komvux: additionally extract SFI-offering units as a separate indicator.
	# Column "SVENSKA FÖR INVANDRARE" = "J" means the unit offers SFI.
	sfi_counts <- raw[["komvux"]] |>
	  add_muni_code() |>
	  categorise_ownership() |>
	  filter(`SVENSKA FÖR INVANDRARE` == "J") |>
	  count(muni_code, muni_name, ownership) |>
	  pivot_wider(names_from = ownership, values_from = n, values_fill = 0L) |>
	  (\(x) {
	    if (!"public" %in% names(x)) {
	      x$public <- 0L
	    }
	    x
	  })() |>
	  (\(x) {
	    if (!"private" %in% names(x)) {
	      x$private <- 0L
	    }
	    x
	  })() |>
	  mutate(
	    n_total = public + private,
	    type = "sfi",
	    n_public = public,
	    n_private = private
	  ) |>
	  select(muni_code, muni_name, type, n_total, n_public, n_private)

	unit_counts <- bind_rows(unit_counts, sfi_counts)

	# 03 – Skolverket API cross-check (optional) -----------------------------------
	# The planned-educations API returns the same institution types as the xlsx.
	# This block fetches the API data and reports any discrepancies between the two.
	# Comment out if offline or if the xlsx is known to be current.

	api_cross_check <- tryCatch(
	  {
	    cat("\nFetching Skolverket API for cross-check...\n")
	    base_url <- "https://api.skolverket.se/planned-educations/school-units"

	    fetch_page <- function(page) {
	      url <- paste0(base_url, "?page=", page, "&size=100")
	      resp <- readLines(url, warn = FALSE) |>
	        paste(collapse = "") |>
	        jsonlite::fromJSON()
	      resp$body
	    }

	    first <- fetch_page(0)
	    n_pages <- first$page$totalPages
	    cat(
	      "  API reports",
	      first$page$totalElements,
	      "units across",
	      n_pages,
	      "pages\n"
	    )

	    all_pages <- map(0:(n_pages - 1), \(p) {
	      if (p %% 10 == 0) {
	        cat("  page", p, "/", n_pages, "\n")
	      }
	      fetch_page(p)$`_embedded`$listedSchoolUnits
	    })

	    api_df <- bind_rows(all_pages) |>
	      transmute(
	        muni_code = str_pad(as.character(geographicalAreaCode), 4, "left", "0"),
	        ownership = if_else(
	          principalOrganizerType %in% c("Kommunal", "Region", "Statlig"),
	          "public",
	          "private"
	        ),
	        type = map_chr(typeOfSchooling, \(t) {
	          if (is.null(t) || nrow(t) == 0) {
	            return(NA_character_)
	          }
	          t$code[1]
	        })
	      ) |>
	      filter(!is.na(type))

	    api_summary <- api_df |>
	      count(muni_code, type, ownership, name = "n_api") |>
	      mutate(
	        type = recode(
	          type,
	          fsk = "forskoleklass",
	          gr = "grundskola",
	          gran = "anpassad_grundskola",
	          sp = "specialskola",
	          sam = "sameskola",
	          gy = "gymnasieskola",
	          gyan = "anpassad_gymnasieskola",
	          vuxgy = "komvux",
	          vuxgr = "komvux",
	          sfi = "sfi"
	        )
	      )

	    cat("  API cross-check complete\n")
	    api_summary
	  },
	  error = function(e) {
	    message("API cross-check skipped: ", conditionMessage(e))
	    NULL
	  }
	)

	# 04 – Higher education institutions (UKÄ list, hardcoded) ---------------------
	# Source: UKÄ register of accredited Swedish higher education institutions.
	# Each row = one institution × one municipality (multi-campus → multiple rows).
	# Verify against: https://www.uka.se/om-oss/kontakt/larosaetenas-webbplatser.html

	he_institutions <- tribble(
	  ~institution                                  , ~muni_code , ~type_he             ,
	  # ---- State universities ----
	  "Uppsala University"                          , "0380"     , "university"         ,
	  "Stockholm University"                        , "0180"     , "university"         ,
	  "Lund University"                             , "1281"     , "university"         ,
	  "University of Gothenburg"                    , "1480"     , "university"         ,
	  "Umeå University"                             , "2480"     , "university"         ,
	  "Linköping University"                        , "0580"     , "university"         ,
	  "Örebro University"                           , "1880"     , "university"         ,
	  "Karlstad University"                         , "1780"     , "university"         ,
	  # ---- State specialised universities ----
	  "KTH Royal Institute of Technology"           , "0180"     , "university"         ,
	  "Karolinska Institutet"                       , "0184"     , "university"         , # Solna
	  "Chalmers University of Technology"           , "1480"     , "university"         , # private, state-grant
	  "SLU – Uppsala"                               , "0380"     , "university"         ,
	  "SLU – Umeå"                                  , "2480"     , "university"         ,
	  "SLU – Alnarp (Lomma)"                        , "1262"     , "university"         ,
	  "SLU – Skara"                                 , "1495"     , "university"         ,
	  # ---- State university colleges ----
	  "Blekinge Institute of Technology"            , "1080"     , "university_college" , # Karlskrona
	  "Dalarna University – Falun"                  , "2080"     , "university_college" ,
	  "Dalarna University – Borlänge"               , "2081"     , "university_college" ,
	  "University of Gävle"                         , "2180"     , "university_college" ,
	  "Halmstad University"                         , "1380"     , "university_college" ,
	  "Kristianstad University"                     , "1290"     , "university_college" ,
	  "Linnaeus University – Växjö"                 , "0780"     , "university_college" ,
	  "Linnaeus University – Kalmar"                , "0880"     , "university_college" ,
	  "Malmö University"                            , "1280"     , "university_college" ,
	  "Mälardalen University – Västerås"            , "1980"     , "university_college" ,
	  "Mälardalen University – Eskilstuna"          , "0484"     , "university_college" ,
	  "Mid Sweden University – Sundsvall"           , "2281"     , "university_college" ,
	  "Mid Sweden University – Östersund"           , "2380"     , "university_college" ,
	  "Södertörn University"                        , "0126"     , "university_college" , # Huddinge
	  "University of Borås"                         , "1490"     , "university_college" ,
	  "University of Skövde"                        , "1496"     , "university_college" ,
	  "University West"                             , "1488"     , "university_college" , # Trollhättan
	  # ---- Private accredited institutions ----
	  "Stockholm School of Economics"               , "0180"     , "university_college" ,
	  "Jönköping University"                        , "0680"     , "university_college" ,
	  # ---- Art, music, design, sport ----
	  "Konstfack"                                   , "0180"     , "university_college" ,
	  "Royal University College of Music (KMH)"     , "0180"     , "university_college" ,
	  "Stockholm University of the Arts"            , "0180"     , "university_college" ,
	  "Royal Institute of Art"                      , "0180"     , "university_college" ,
	  "Beckmans College of Design"                  , "0180"     , "university_college" ,
	  "Swedish School of Sport and Health Sciences" , "0180"     , "university_college" ,
	  # ---- Defence / health ----
	  "Swedish Defence University"                  , "0180"     , "university_college" ,
	  "Sophiahemmet University"                     , "0180"     , "university_college" ,
	  "Ersta Sköndal Bräcke University College"     , "0180"     , "university_college" ,
	  "Röda Korsets Högskola"                       , "0180"     , "university_college" ,
	  "Newmaninstitutet"                            , "0380"     , "university_college"
	)

	he_counts <- he_institutions |>
	  count(muni_code, type_he, name = "n_total") |>
	  rename(type = type_he) |>
	  # All Swedish HE institutions are state-funded or receive >90% public funding;
	  # public/private distinction used for school units does not apply here.
	  mutate(n_public = n_total, n_private = 0L, muni_name = NA_character_)

	# 05 – Combine all sources and reshape to wide ---------------------------------

	long <- bind_rows(
	  unit_counts,
	  he_counts
	)

	# Load the municipality reference to fill in any missing names and ensure
	# all 290 m_sample municipalities appear (with 0s for absent institution types)
	munis <- readRDS("m_sample.rds") |>
	  select(muni_code = code, muni_name_ref = municipality) |>
	  mutate(muni_code = str_pad(as.character(muni_code), 4, "left", "0"))

	all_types <- unique(long$type)

	wide <- munis |>
	  cross_join(tibble(type = all_types)) |>
	  left_join(
	    long |> select(muni_code, type, n_total, n_public, n_private),
	    by = c("muni_code", "type")
	  ) |>
	  mutate(
	    n_total = replace_na(n_total, 0L),
	    n_public = replace_na(n_public, 0L),
	    n_private = replace_na(n_private, 0L)
	  ) |>
	  pivot_wider(
	    names_from = type,
	    values_from = c(n_total, n_public, n_private),
	    names_glue = "{type}_{.value}"
	  ) |>
	  rename(municipality = muni_name_ref, code = muni_code)

	# 06 – Save --------------------------------------------------------------------

	write_rds(wide, "edu_offer.rds")
	write_csv(wide, "edu_offer.csv")
```

```{r}
#| echo: false
#| message: false
#| warning: false
#| results: false

#  Merging edu_offer and m_sample into m_sample_enriched.

m_sample_enriched <- read_rds("m_sample.rds") |>
  mutate(code = str_pad(as.character(code), 4, "left", "0")) |>
  left_join(wide |> select(-municipality), by = "code")

write_rds(m_sample_enriched, "m_sample.rds")
```