A People Analytics Guide to Spotting Trend Reversals in Workforce Data
People Analytics
Simpon’s Paradox
R Programming
Data Visualization
Confounding Factors
Two case studies illustrating how apparent trends in workforce data can reverse dramatically across or within subgroups.
Author
Gordon Goodwin
Published
April 6, 2025
Figure 1: Illustration of Simpson’s Paradox at work. Compensation within the general employee population is negatively related to seniority, but the trend reverses when the data are sliced by org. Within each department, employees with higher seniority levels tend to command higher salaries.
Overview
Simpson’s Paradox is a phenomenon where an observed statistical relationship or trend reverses when the data are collapsed across or stratified by subgroups.
In workforce data, Simpson’s Paradox typically manifests in one of 2 main ways:
Top-Down: an association observed within an employee population reverses when data are sliced by smaller employee subgroups.
Bottom’s Up: an association first observed within employee subgroups reverses when data from the groups are combined.
In this guide we’ll use 2 people analytics case studies to illustrate how to identify and control for the paradox using R.
Case Study #1
The Chief People Officer (CPO) of a small but growing tech company has asked us to examine the relationship between employee compensation and seniority
The CPO is a staunch advocate of rewarding seniority with increased compensation.
The company’s compensation strategy has been specifically designed to reflect this.
The CPO has asked us to review whether salary and seniority are in fact positively related.
Packages & Set-Up
Code
knitr::opts_chunk$set(dev ="ragg_png")# No Scientific Notationoptions(scipen =999)# Packageslibrary(tidyverse)library(tidymodels)library(censored)library(janitor)library(ggfun)library(survminer)library(ggsurvfit)library(patchwork)library(ggfortify)library(ggridges)library(tidycmprsk)library(skimr)library(GGally)# Source Helper/Utility Functions## Fontssource("font_setup.R")## Themesource("theme_setup.R")# Visualization Parameters (Themes & Fonts)# FONTS## Adds Google Fonts & Sets Config setup_fonts()## Ties Font Family Names to List for Referencingfonts <-get_font_families()# COLORS## Ties Default Colors & Specified Palette to List Referencescolors <-get_theme_colors(palette =c("#663171","#ea7428","#0c7156","black","firebrick","#f9d14a","#ff6153","#1a318b","#ce9462","#3b3a3e","#6d2f20","skyblue1","white","wheat" ))# THEMES## CREATE BASE THEME - uses font and color list refs internallybase_theme <-create_base_theme(colors = colors)## Specify more theme elements to add to base theme (optional)theme1 <-extend_base_theme(base_theme = base_theme,extended_elements =theme(plot.title =element_text(color ="grey20") ))# Set Themetheme_set(theme1)
Load & Preview Data
First we’ll load and preview our dataset, which contains basic HR info for a sample of 150 company employees across 3 departments:
When we first plot salary against seniority for the general employee population, it appears that compensation is negatively associated with years of seniority!
Figure 2: Illustration of the initial trend: When salary is plotted against seniority years for the general employee population, compensation appears to have a negative relationship with seniority.
Code
# Plotdat1main |>ggplot(aes(x = seniority_years,y = salary_k)) +geom_point(size =1, alpha =0.7) +geom_smooth(method ="lm", se = F) +scale_x_continuous(breaks =seq(0, 10, 1)) +scale_y_continuous(labels = scales::label_dollar(suffix ="k")) +labs(x ="Seniority Years", y ="Salary",title ="Salary Appears Negatively Related to Seniority") +theme(plot.title =element_text(color ="firebrick")) -> p1mainp1main
A linear regression model confirms that after holding all other factors constant:
Each additional year of seniority is associated with a $22,000 decrease in average salary.
Code
# Regression Model: Salary ~ Seniority# Linear Regression Modelm1 <-lm(salary ~ seniority_years, data = dat1main)# Tidy Model Output instead of ugly base R summarytidy(m1)
When we slice our data by department, however, there is a definite trend reversal!
For employees in the same department, compensation is positively related to seniority
Figure 3: Illustration of the trend reversal: Within each departmental subgroup, employees with higher seniority levels tend to also have higher salaries.
Code
# Plot - WITHIN GROUPSdat1main |>ggplot(aes(x = seniority_years,y = salary_k,color = department)) +geom_point(size =1, alpha =0.7) +geom_smooth(method ="lm", se = F) +scale_x_continuous(breaks =seq(0, 10, 1)) +scale_y_continuous(labels =label_dollar(suffix ="k")) +scale_color_manual(values =c(colors$palette[1], colors$palette[2], colors$palette[3])) +labs(x ="Seniority Years", y ="Salary", color ="Department",title ="Trend Reversal When Sliced by Department") +theme(plot.title =element_text(color ="royalblue"),legend.text =element_text(family = fonts$title)) -> p1reversep1reverse
A multivariate regression model confirms this, showing that when holding all else constant:
After controlling for departmental differences, each additional year of seniority is now associated with a $13,500 increase in average salary!
Compared to the Data Science employees (our model reference group), Product team members earn $57K less on average, and HR employees earn $113.6K less!
Code
# Regression Model: Salary ~ Seniority# Linear Regression Modelm1b_reversal <-lm(salary ~ seniority_years + department, data = dat1main)# Tidy Model Output tidy(m1b_reversal)
So how is it that the initial trend completely reversed when we sliced by department?
When we visualize all of our variables simultaneously, we can see that:
Salary & Seniority both vary by department, but in opposite directions
Salary (top right & bottom right boxes):Data Science employees tend to be the highest paid, HR employees tend to be the lowest paid, and Product employees fall in the middle.
Seniority (middle col/top and middle rows): On average, HR employees are the longest-tenured, Data Science employees are the newest, and Product employees are again somewhere in the middle.
Salary and seniority are negatively correlated overall, but positively correlated within departments (far right col/middle row). This is the Paradox in a nutshell.
Figure 4: Pairs Matrix showing the univariate and bivariate distributions of department, salary, and seniority.
Salary appears to vary negatively with seniority at the company-wide level because the higher-seniority employees are concentrated in the lower-paying departments.
In other words, department acts as a confounding factor for the association between salary and seniority. Within each department, the association is positive.
Case Study #2
The CPO was so impressed by our insights that he has a new project for us! Now he wants us to analyze the relationship between employee pre-hire sourcing and job satisfaction.
Prior analyses found that referrals tend to perform better than those who applied traditionally.
Job Satisfaction is a critical KPI, and the company routinely administers surveys to monitor it.
The CPO wants us to use survey and pre-hire sourcing data to determine whether referred employees are more likely to rate their job satisfaction favorably than traditional applicants.
Load and Preview Data
Our second dataset is much larger, containing survey responses and pre-hire sourcing data for all 1000 employees who responded.
# A tibble: 5 × 4
emp_ID referral location satisfied
<chr> <chr> <chr> <chr>
1 x1 Referred Remote Favorable
2 x2 Referred Remote Unfavorable
3 x3 Referred Remote Unfavorable
4 x4 Referred Remote Favorable
5 x5 Referred Remote Unfavorable
Analysis
Initial Trend
When we plot the initial relationship between job satisfaction and pre-hire source, it appears that referred employees are less likely to rate their job satisfaction favorably!
Figure 5: Stacked-Bar (left) and Trend Line (right) plots visualizing job satisfaction favorability for referrals and traditional applicants. Both plots show that referred employees tend to have lower job satisfaction levels.
A logistic regression model confirms that, holding all else constant,:
Referred employees are 64% as likely (36% less likely) to rate their jobs favorably in comparison to traditional applicants.
Code
# Logistic Regression Model## -> Favorable Y/N ~ Referral Status # Convert Satisfaction (Favorable/Unfavorable) to 1/0 for modeldat2main$satisfied_recoded <-ifelse(dat2main$satisfied =="Favorable",1,0)# Relabel referral and location for legibilitydat2main$referral <-factor(dat2main$referral, labels =c("No","Yes"))# Logistic regression modelm3a <-glm(satisfied_recoded ~ referral, data = dat2main, family ="binomial")# Tidy Model Output rounded to 3 digits (Exponentiated Coefficients to get Odds Ratios)tidy(m3a) |>mutate(across(where(is.numeric),~round(.x, 3))) |># Exponentiate to get odds ratio mutate(`exp(estimate)`=exp(estimate), .before = estimate)
When we slice our data by work location (Office vs Remote), however, we see another trend reversal!
Within each work modality, referred employees are more likely to be satisfed than those who applied traditionally.
Figure 6: Stacked-Bar (left) and Trend Line (right) plots that facet by work location type before visualizing job satisfaction favorability by pre-hire source. Both plots show that after accounting for work modality, referred employees tend to rate their job satisfaction more favorably than traditional applicants.
Code
# Plot A) Stacked Bar ~ Locationdat2main |>count(referral, satisfied, location) |>mutate(prop =round(n/sum(n),2),prop_label = glue::glue("{scales::percent(prop)} ({n})"),.by =c(referral, location)) |>mutate(satisfied =fct_relevel(satisfied, c("Unfavorable","Favorable"))) |># filter(satisfied == "Favorable") |> ggplot(aes(x = referral,y = prop,fill = satisfied)) +geom_col(width =0.65, color ="black") +scale_y_continuous(labels = scales::label_percent()) +coord_cartesian(ylim =c(0, 1)) +geom_text(aes(label = prop_label),position =position_stack(vjust = .4),colour = colors$palette[13], size =2.5,fontface ="bold") +facet_wrap(~location) +labs(x ="Pre-Hire Source",y ="Responded Favorably",fill ="Job Satisf.") +# theme(legend.position = "right") +scale_fill_manual(values =c(colors$palette[5],colors$palette[3])) -> p3ap3a +ggtitle("Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably") +theme(plot.title =element_text(color = colors$palette[1]), )# Plot B) Trend Line ~ Locationdat2main |>count(referral, satisfied, location) |>mutate(prop = n /sum(n), .by =c(referral, location)) |>filter(satisfied =="Favorable") |># Keep only favorability dataggplot(aes(x = referral, y = prop, color = location, group = location)) +geom_line(size =1.2, lty =2) +geom_point(size =4, shape =1, stroke =1.5) +geom_text(aes(label = scales::percent(prop, accuracy =1)), vjust =-1.5,color = colors$palette[4], fontface ="bold", size =3) +scale_y_continuous(labels = scales::label_percent(), limits =c(0, 1)) +scale_color_manual(values =c(colors$palette[7],colors$palette[8])) +facet_wrap(~location, scales ="free_y") +labs(x ="Pre-Hire Source",y ="% Responded Favorably", ) +theme(legend.position ="none") -> p3bp3b +ggtitle("Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably") +theme(plot.title =element_text(color = colors$palette[1]), )# SIDE BY SIDEp3a + p3b +plot_annotation(title ="Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably",theme =theme(plot.title =element_text(color = colors$palette[1],size =18,hjust =0.5) ))
A multivariate logistic regression model confirms the reversal:
After controlling for work environment, referred employees are ~2.5x more likely to rate their satisfaction favorably!
Remote team members are over 12.5x more likely to rate their jobs favorably in comparison to In-Office employees (our model reference group).
Code
# Logistic Regression Model## -> Favorable Y/N ~ Referral Status + Location# Logistic regression modelm3 <-glm(satisfied_recoded ~ referral + location, data = dat2main, family ="binomial")# Tidy Model Output rounded to 3 digits (Exponentiated Coefficients to get Odds Ratios)tidy(m3) |>mutate(across(where(is.numeric),~round(.x, 3))) |># Exponentiate to get odds ratio mutate(`exp(estimate)`=exp(estimate), .before = estimate)
So how is it that the initial trend completely reversed when we sliced by Office/Remote status?
When we visualize sourcing, satisfaction, and location simultaneously, we can see that:
Referred employees are more likely to be office-based, and office-based employees are less likely to be satisfied.
Traditional applicants are more likely to work remotely, and remote employees are more likely to be satisfied.
Within each location type, referrals are more likely to be satisfied than traditional applicants.
Figure 7: Illustration of how sample size imbalances lead to a trend reversal after controlling for location type. Bubble sizing corresponds to sample size, and shows us that referred employees are heavily Office-based, while traditional applicants are predominantly Remote. Color gradient maps to the proportion who rated their jobs favorably, which shows us that Office-based employees are significantly less-satisfied than Remote workers. Finally, within each work modality, referrals are much more likely to respond favorably.
Code
# Crosstab (used by plots): Prop & Counts Satisfied by Referral/Applied & Office/Remote cutsdat2main |>summarize(prop_satisfied =mean(satisfied =="Favorable"),count_satisfied =sum(satisfied =="Favorable"),count_NOTsatisfied =sum(satisfied !="Favorable"),n =n(),.by =c(referral, location)) |>mutate(prop_label = glue::glue("(n={n}) Satisf = {scales::percent(prop_satisfied)}")) |>arrange(referral, location) -> tab1tab1# Plot A) Bubble Viztab1 |>mutate(referral =if_else(referral=="Yes","Referred","Applied")) |>ggplot(aes(x = referral, y = location)) +geom_point(aes(size = n, color = prop_satisfied)) +scale_color_gradient2(low ="#cf3a36", mid ="grey70", high ="#0c7156",midpoint = .5,labels = scales::label_percent()) +# scale_color_viridis_c(option = "H", direction = 1) +scale_size_area(max_size =12) +theme(legend.position ="right",plot.title =element_text(color = colors$palette[1], size =16),legend.background =element_rect(color = colors$palette[4])) +labs(x ="Source", y ="Location", size ="Emps (#)", color ="Satisfied",title ="Referral Job Satisfaction is Confounded by Their Higher In-Office Presence") +geom_text(aes(x = referral, y = location, label = prop_label),color = colors$palette[4],vjust =-2)
We can see the same sample size imbalances in a mosaic plot:
Figure 8: Mosaic plot visualization of the underlying sample size imbalances across (Office vs Remote) and (Referred vs Applied) subgroups. As in the bubble plot, the mosaic plot highlights the fact that traditional applicants are skewed heavily towards Remote work environments, while referred employees are much more likely to be Office-based.
Referred employees appear to be less satisfied overall because they are predominantly office-based, and office teams tend to be less satisfied than remote teams regardless of pre-hire source.
In other words, location type acts as a confounding factor for the association between referral status and job favorability. Location masks the fact that within each work modality, referred employees tend to view their jobs more favorably than traditional applicants.
Source Code
---title: "One Chart, Two Realities: Simpson's Paradox at Work"subtitle: "A People Analytics Guide to Spotting Trend Reversals in Workforce Data"description: "Two case studies illustrating how apparent trends in workforce data can reverse dramatically across or within subgroups."author: "Gordon Goodwin"date: "2025-04-06"categories: ["People Analytics", "Simpon's Paradox", "R Programming", "Data Visualization","Confounding Factors"]format: html: toc: true toc-depth: 5 toc-title: "Contents" toc-location: right number-sections: false code-link: true code-fold: true code-tools: true code-line-numbers: true code-summary: "Code" self-contained: true theme: light: [flatly] dark: [darkly]editor_options: chunk_output_type: inlineexecute: cache: false error: false message: false warning: false---------------------------------------------------------------------------{#fig-1}# Overview**Simpson's Paradox** is a phenomenon where an observed statistical relationship or trend reverses when the data are collapsed across or stratified by subgroups.In workforce data, Simpson's Paradox typically manifests in one of 2 main ways:- **Top-Down**: an association observed within an employee population reverses when data are sliced by smaller employee subgroups.- **Bottom's Up**: an association first observed within employee subgroups reverses when data from the groups are combined.In this guide we'll use 2 people analytics case studies to illustrate how to identify and control for the paradox using `R`.------------------------------------------------------------------------# Case Study #1The Chief People Officer (CPO) of a small but growing tech company has asked us to examine the relationship between **employee compensation** and **seniority**- The CPO is a staunch advocate of rewarding seniority with increased compensation.- The company's compensation strategy has been specifically designed to reflect this.**The CPO has asked us to review whether salary and seniority are in fact positively related.**------------------------------------------------------------------------## Packages & Set-Up```{r}#| label: load#| warning: false#| message: false #| output: falseknitr::opts_chunk$set(dev ="ragg_png")# No Scientific Notationoptions(scipen =999)# Packageslibrary(tidyverse)library(tidymodels)library(censored)library(janitor)library(ggfun)library(survminer)library(ggsurvfit)library(patchwork)library(ggfortify)library(ggridges)library(tidycmprsk)library(skimr)library(GGally)# Source Helper/Utility Functions## Fontssource("font_setup.R")## Themesource("theme_setup.R")# Visualization Parameters (Themes & Fonts)# FONTS## Adds Google Fonts & Sets Config setup_fonts()## Ties Font Family Names to List for Referencingfonts <-get_font_families()# COLORS## Ties Default Colors & Specified Palette to List Referencescolors <-get_theme_colors(palette =c("#663171","#ea7428","#0c7156","black","firebrick","#f9d14a","#ff6153","#1a318b","#ce9462","#3b3a3e","#6d2f20","skyblue1","white","wheat" ))# THEMES## CREATE BASE THEME - uses font and color list refs internallybase_theme <-create_base_theme(colors = colors)## Specify more theme elements to add to base theme (optional)theme1 <-extend_base_theme(base_theme = base_theme,extended_elements =theme(plot.title =element_text(color ="grey20") ))# Set Themetheme_set(theme1)```## Load & Preview DataFirst we'll load and preview our dataset, which contains basic HR info for a sample of 150 company employees across 3 departments:```{r}#| label: read_data1#| warn: false#| message: false# READ DATAread_csv("Simpson_Paradox_Data_CS1.csv") -> dat1main## Previewdat1main |>head()```## Analysis### Initial TrendWhen we first plot salary against seniority for the general employee population, **it appears that compensation is negatively associated with years of seniority!**{#fig-2}```{r}#| label: viz1_maintrend#| warning: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plotdat1main |>ggplot(aes(x = seniority_years,y = salary_k)) +geom_point(size =1, alpha =0.7) +geom_smooth(method ="lm", se = F) +scale_x_continuous(breaks =seq(0, 10, 1)) +scale_y_continuous(labels = scales::label_dollar(suffix ="k")) +labs(x ="Seniority Years", y ="Salary",title ="Salary Appears Negatively Related to Seniority") +theme(plot.title =element_text(color ="firebrick")) -> p1mainp1main```A linear regression model confirms that after holding all other factors constant:- **Each additional year of seniority is associated with a $22,000 *decrease* in average salary.**```{r}#| label: mod1_maintrend#| warn: false# Regression Model: Salary ~ Seniority# Linear Regression Modelm1 <-lm(salary ~ seniority_years, data = dat1main)# Tidy Model Output instead of ugly base R summarytidy(m1)```------------------------------------------------------------------------### Trend ReversalWhen we slice our data by department, however, there is a definite trend reversal!**For employees in the same department, compensation is *positively* related to seniority**{#fig-3}```{r}#| label: viz1b_reversal#| warning: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plot - WITHIN GROUPSdat1main |>ggplot(aes(x = seniority_years,y = salary_k,color = department)) +geom_point(size =1, alpha =0.7) +geom_smooth(method ="lm", se = F) +scale_x_continuous(breaks =seq(0, 10, 1)) +scale_y_continuous(labels =label_dollar(suffix ="k")) +scale_color_manual(values =c(colors$palette[1], colors$palette[2], colors$palette[3])) +labs(x ="Seniority Years", y ="Salary", color ="Department",title ="Trend Reversal When Sliced by Department") +theme(plot.title =element_text(color ="royalblue"),legend.text =element_text(family = fonts$title)) -> p1reversep1reverse ```A **multivariate regression model** confirms this, showing that when holding all else constant:- **After controlling for departmental differences, each additional year of seniority is now associated with a \$13,500 increase in average salary!**- Compared to the Data Science employees (our model reference group), Product team members earn \$57K less on average, and HR employees earn \$113.6K less!```{r}#| label: mod1b_reversal#| warn: false#| message: false# Regression Model: Salary ~ Seniority# Linear Regression Modelm1b_reversal <-lm(salary ~ seniority_years + department, data = dat1main)# Tidy Model Output tidy(m1b_reversal)```------------------------------------------------------------------------### Explaining the Reversal***So how is it that the initial trend completely reversed when we sliced by department?***When we visualize all of our variables simultaneously, we can see that:- **Salary & Seniority both vary by department, but in opposite directions**- **Salary** (top right & bottom right boxes):Data Science employees tend to be the highest paid, HR employees tend to be the lowest paid, and Product employees fall in the middle.- **Seniority** (middle col/top and middle rows): On average, HR employees are the longest-tenured, Data Science employees are the newest, and Product employees are again somewhere in the middle.- **Salary and seniority are negatively correlated overall, but positively correlated within departments** (far right col/middle row). This is the Paradox in a nutshell.{#fig-4}```{r}#| label: cs1_ggally_pairsplot#| warn: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plotdat1main |>select(c(Department = department, `Seniority Yrs`= seniority_years, Salary = salary)) |>ggpairs(aes(color = Department, alpha =0.4)) +scale_color_manual(values =c(colors$palette[1], colors$palette[2], colors$palette[3])) +scale_fill_manual(values =c(colors$palette[1], colors$palette[2], colors$palette[3])) ```### Recap**Salary appears to vary negatively with seniority at the company-wide level because the higher-seniority employees are concentrated in the lower-paying departments.**In other words, **department acts as a confounding factor for the association between salary and seniority**. *Within each department*, the association is positive.------------------------------------------------------------------------# Case Study #2The CPO was so impressed by our insights that he has a new project for us! Now he wants us to analyze the relationship between employee **pre-hire sourcing** and **job satisfaction**.- Prior analyses found that referrals tend to perform better than those who applied traditionally.- Job Satisfaction is a critical KPI, and the company routinely administers surveys to monitor it.**The CPO wants us to use survey and pre-hire sourcing data to determine whether referred employees are more likely to rate their job satisfaction favorably than traditional applicants.**------------------------------------------------------------------------## Load and Preview DataOur second dataset is much larger, containing survey responses and pre-hire sourcing data for all 1000 employees who responded.- Pre-Hire Source (Referred vs Applied)- Job Satisfaction (Favorable vs Unfavorable)- Location (Remote vs Office)```{r}#| label: read_data2#| warn: false#| message: false# READ DATAread_csv("Simpson_Paradox_Data_CS2.csv") -> dat2main## Previewdat2main |>head(n =5)```## Analysis### Initial TrendWhen we plot the initial relationship between job satisfaction and pre-hire source, **it appears that referred employees are *less likely* to rate their job satisfaction favorably!**{#fig-5}```{r}#| label: viz2_maintrend#| warning: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plot A) Stacked Bardat2main |>count(referral, satisfied) |>mutate(prop =round(n/sum(n),2),prop_label = glue::glue("{scales::percent(prop)} ({n})"),.by = referral) |>mutate(satisfied =fct_relevel(satisfied, c("Unfavorable","Favorable"))) |># filter(satisfied == "Favorable") |> ggplot(aes(x = referral,y = prop,fill = satisfied)) +geom_col(width =0.4, color ="black") +scale_y_continuous(labels = scales::label_percent()) +coord_cartesian(ylim =c(0, 1)) +geom_text(aes(label = prop_label),position =position_stack(vjust = .4),colour = colors$palette[13], size =3,fontface ="bold") +scale_fill_manual(values =c(colors$palette[5], colors$palette[3])) +labs(x ="Pre-Hire Source",y ="Responded Favorably",fill ="Job Satisf.") +theme(plot.title =element_text(color = colors$palette[1])) -> p2a# Plot B) Trend Linedat2main |>count(referral, satisfied) |>mutate(prop = n /sum(n), .by = referral) |>filter(satisfied =="Favorable") |># Keep only favorability dataggplot(aes(x = referral, y = prop, group =1)) +geom_line(color = colors$palette[3], linewidth =1.2, lty =2) +geom_point(size =4, color = colors$palette[3], shape =1, stroke =1.5) +scale_y_continuous(labels = scales::label_percent(), limits =c(0, 1)) +labs(x ="Pre-Hire Source",y ="Responded Favorably" ) +geom_text(aes(label = scales::percent(prop, accuracy =1)), vjust =-1.5,color = colors$palette[4], fontface ="bold", size =3) -> p2b# Side-by-Sidep2a + p2b +plot_annotation(title ="Referrals are Less Likely to Rate Their Job Satisfaction Favorably",theme =theme(plot.title =element_text(color = colors$palette[1],size =14,hjust =0.5) ))```A **logistic regression model** confirms that, holding all else constant,:- **Referred employees are 64% as likely (36% less likely) to rate their jobs favorably in comparison to traditional applicants.**```{r}#| label: mod2_maintrend#| warn: false# Logistic Regression Model## -> Favorable Y/N ~ Referral Status # Convert Satisfaction (Favorable/Unfavorable) to 1/0 for modeldat2main$satisfied_recoded <-ifelse(dat2main$satisfied =="Favorable",1,0)# Relabel referral and location for legibilitydat2main$referral <-factor(dat2main$referral, labels =c("No","Yes"))# Logistic regression modelm3a <-glm(satisfied_recoded ~ referral, data = dat2main, family ="binomial")# Tidy Model Output rounded to 3 digits (Exponentiated Coefficients to get Odds Ratios)tidy(m3a) |>mutate(across(where(is.numeric),~round(.x, 3))) |># Exponentiate to get odds ratio mutate(`exp(estimate)`=exp(estimate), .before = estimate)```------------------------------------------------------------------------### Trend ReversalWhen we slice our data by work location (Office vs Remote), however, we see another trend reversal!**Within each work modality, referred employees are more likely to be satisfed than those who applied traditionally.**{#fig-6}```{r}#| label: cs2_reversal#| warn: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plot A) Stacked Bar ~ Locationdat2main |>count(referral, satisfied, location) |>mutate(prop =round(n/sum(n),2),prop_label = glue::glue("{scales::percent(prop)} ({n})"),.by =c(referral, location)) |>mutate(satisfied =fct_relevel(satisfied, c("Unfavorable","Favorable"))) |># filter(satisfied == "Favorable") |> ggplot(aes(x = referral,y = prop,fill = satisfied)) +geom_col(width =0.65, color ="black") +scale_y_continuous(labels = scales::label_percent()) +coord_cartesian(ylim =c(0, 1)) +geom_text(aes(label = prop_label),position =position_stack(vjust = .4),colour = colors$palette[13], size =2.5,fontface ="bold") +facet_wrap(~location) +labs(x ="Pre-Hire Source",y ="Responded Favorably",fill ="Job Satisf.") +# theme(legend.position = "right") +scale_fill_manual(values =c(colors$palette[5],colors$palette[3])) -> p3ap3a +ggtitle("Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably") +theme(plot.title =element_text(color = colors$palette[1]), )# Plot B) Trend Line ~ Locationdat2main |>count(referral, satisfied, location) |>mutate(prop = n /sum(n), .by =c(referral, location)) |>filter(satisfied =="Favorable") |># Keep only favorability dataggplot(aes(x = referral, y = prop, color = location, group = location)) +geom_line(size =1.2, lty =2) +geom_point(size =4, shape =1, stroke =1.5) +geom_text(aes(label = scales::percent(prop, accuracy =1)), vjust =-1.5,color = colors$palette[4], fontface ="bold", size =3) +scale_y_continuous(labels = scales::label_percent(), limits =c(0, 1)) +scale_color_manual(values =c(colors$palette[7],colors$palette[8])) +facet_wrap(~location, scales ="free_y") +labs(x ="Pre-Hire Source",y ="% Responded Favorably", ) +theme(legend.position ="none") -> p3bp3b +ggtitle("Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably") +theme(plot.title =element_text(color = colors$palette[1]), )# SIDE BY SIDEp3a + p3b +plot_annotation(title ="Referrals Within Each Location are *More* Likely to Rate Their Job Satisfaction Favorably",theme =theme(plot.title =element_text(color = colors$palette[1],size =18,hjust =0.5) ))```A **multivariate logistic regression model** confirms the reversal:- **After controlling for work environment, referred employees are \~2.5x more likely to rate their satisfaction favorably!**- **Remote team members are over 12.5x more likely to rate their jobs favorably** in comparison to In-Office employees (our model reference group).```{r}#| label: mod2_reversal#| warn: false#| message: false# Logistic Regression Model## -> Favorable Y/N ~ Referral Status + Location# Logistic regression modelm3 <-glm(satisfied_recoded ~ referral + location, data = dat2main, family ="binomial")# Tidy Model Output rounded to 3 digits (Exponentiated Coefficients to get Odds Ratios)tidy(m3) |>mutate(across(where(is.numeric),~round(.x, 3))) |># Exponentiate to get odds ratio mutate(`exp(estimate)`=exp(estimate), .before = estimate)```------------------------------------------------------------------------### Explaining the Reversal***So how is it that the initial trend completely reversed when we sliced by Office/Remote status?***When we visualize sourcing, satisfaction, and location simultaneously, we can see that:- *Referred employees are more likely to be office-based, and office-based employees are less likely to be satisfied.*- *Traditional applicants are more likely to work remotely, and remote employees are more likely to be satisfied.*- **Within each location type, referrals are more likely to be satisfied than traditional applicants.**{#fig-7}```{r}#| label: cs2_explained_plotA#| warning: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Crosstab (used by plots): Prop & Counts Satisfied by Referral/Applied & Office/Remote cutsdat2main |>summarize(prop_satisfied =mean(satisfied =="Favorable"),count_satisfied =sum(satisfied =="Favorable"),count_NOTsatisfied =sum(satisfied !="Favorable"),n =n(),.by =c(referral, location)) |>mutate(prop_label = glue::glue("(n={n}) Satisf = {scales::percent(prop_satisfied)}")) |>arrange(referral, location) -> tab1tab1# Plot A) Bubble Viztab1 |>mutate(referral =if_else(referral=="Yes","Referred","Applied")) |>ggplot(aes(x = referral, y = location)) +geom_point(aes(size = n, color = prop_satisfied)) +scale_color_gradient2(low ="#cf3a36", mid ="grey70", high ="#0c7156",midpoint = .5,labels = scales::label_percent()) +# scale_color_viridis_c(option = "H", direction = 1) +scale_size_area(max_size =12) +theme(legend.position ="right",plot.title =element_text(color = colors$palette[1], size =16),legend.background =element_rect(color = colors$palette[4])) +labs(x ="Source", y ="Location", size ="Emps (#)", color ="Satisfied",title ="Referral Job Satisfaction is Confounded by Their Higher In-Office Presence") +geom_text(aes(x = referral, y = location, label = prop_label),color = colors$palette[4],vjust =-2)```We can see the same **sample size imbalances** in a **mosaic plot:**{#fig-8}```{r}#| label: cs2_explained_plotB#| warning: false#| message: false#| eval: false#| #out-width: 100%#| #fig-format: svg# Plot B) Mosaic Plot# Mosaic Plot of Sample Size Imbalance Across Locationsdat2main |>mutate(referral =factor(referral, labels =c("Applied","Referred")),location =fct(location),satisfied =fct(satisfied)) |>ggplot() +geom_mosaic(aes(x =product(location, referral), fill = location), show.legend = T) +# theme_mosaic() +scale_fill_manual(values =c(colors$palette[8],colors$palette[7])) +coord_flip() +theme(plot.title =element_text(color = colors$palette[1], size =16),legend.position ="top") +labs(y ="Work Location",x ="Pre-Hire Source",fill ="Location",title ="Work Location Imbalance Across Referral Groups") ```### Recap*Referred employees appear to be less satisfied overall because they are predominantly office-based, and office teams tend to be less satisfied than remote teams regardless of pre-hire source.*In other words, **location type acts as a confounding factor for the association between referral status and job favorability.** Location masks the fact that within each work modality, referred employees tend to view their jobs more favorably than traditional applicants.