R Markdown
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
## BUSINESS QUESTION
#What can 10 years of GPA data at UIUC reveal about grade distribution, course difficulty, and fairness across departments and instructors?
#Most students experience GPA as a number. Departments see it as a metric
#Grades are often treated as fixed outcomes—but when viewed at scale, they reflect broader academic systems, teaching practices, and student support structures.
#This project was a deep dive into 10 years of GPA data from the University of Illinois (2010–2020)—not just to visualize the past, but to identify opportunities for a better academic future.
#METHODOLOGY
#The dataset includes over 56,000 course-level records from the University of Illinois, covering the years 2010 to 2020. I cleaned the data using tidyverse, removed missing values, and performed exploratory analysis using grouping, histograms, boxplots, and contingency tables. Visualization was done using ggplot2.
data <- read.csv("/Users/jayeshghosh/Desktop/gpa.csv")
#Skills Used:
#R, Tidyverse, ggplot2, Data Cleaning, Statistical Summaries, Data Visualization, Educational Analytics
summary(data)
## Year Term YearTerm Subject
## Min. :2010 Length:69069 Length:69069 Length:69069
## 1st Qu.:2013 Class :character Class :character Class :character
## Median :2016 Mode :character Mode :character Mode :character
## Mean :2016
## 3rd Qu.:2020
## Max. :2023
## Number Course.Title Sched.Type A.
## Min. : 2.0 Length:69069 Length:69069 Min. : 0.000
## 1st Qu.:133.0 Class :character Class :character 1st Qu.: 0.000
## Median :272.0 Mode :character Mode :character Median : 2.000
## Mean :293.5 Mean : 7.269
## 3rd Qu.:424.0 3rd Qu.: 7.000
## Max. :798.0 Max. :929.000
## A A..1 B. B
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 11.00 Median : 5.000 Median : 4.000 Median : 4.000
## Mean : 18.79 Mean : 7.618 Mean : 5.879 Mean : 7.369
## 3rd Qu.: 20.00 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 8.000
## Max. :1034.00 Max. :351.000 Max. :210.000 Max. :250.000
## B..1 C. C C..1
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 0.000
## Mean : 3.182 Mean : 2.144 Mean : 2.705 Mean : 1.169
## 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 1.000
## Max. :144.000 Max. :138.000 Max. :111.000 Max. :66.000
## D. D D..1 F
## Min. : 0.0000 Min. : 0.00 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.00 Median : 0.0000 Median : 0.00
## Mean : 0.5979 Mean : 0.82 Mean : 0.3607 Mean : 1.18
## 3rd Qu.: 0.0000 3rd Qu.: 1.00 3rd Qu.: 0.0000 3rd Qu.: 1.00
## Max. :45.0000 Max. :61.00 Max. :29.0000 Max. :75.00
## W Students Primary.Instructor
## Min. : 0.00 Min. : 21.00 Length:69069
## 1st Qu.: 0.00 1st Qu.: 26.00 Class :character
## Median : 0.00 Median : 34.00 Mode :character
## Mean : 0.23 Mean : 59.08
## 3rd Qu.: 0.00 3rd Qu.: 55.00
## Max. :16.00 Max. :1913.00
str(data)
## 'data.frame': 69069 obs. of 23 variables:
## $ Year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ Term : chr "Spring" "Spring" "Spring" "Spring" ...
## $ YearTerm : chr "2023-sp" "2023-sp" "2023-sp" "2023-sp" ...
## $ Subject : chr "AAS" "AAS" "AAS" "AAS" ...
## $ Number : int 100 100 100 200 215 141 152 232 425 457 ...
## $ Course.Title : chr "Intro Asian American Studies" "Intro Asian American Studies" "Intro Asian American Studies" "U.S. Race and Empire" ...
## $ Sched.Type : chr "DIS" "DIS" "DIS" "LCD" ...
## $ A. : int 0 0 0 6 16 0 55 3 0 1 ...
## $ A : int 11 17 13 15 12 5 115 13 22 14 ...
## $ A..1 : int 5 2 2 5 2 4 0 5 0 4 ...
## $ B. : int 1 1 2 2 1 6 4 5 2 0 ...
## $ B : int 4 2 0 3 1 5 9 1 0 2 ...
## $ B..1 : int 0 1 2 0 1 2 0 0 0 0 ...
## $ C. : int 0 0 0 0 0 0 0 0 0 0 ...
## $ C : int 0 0 0 0 0 3 1 0 0 1 ...
## $ C..1 : int 0 0 1 0 0 1 0 0 0 0 ...
## $ D. : int 0 0 0 0 0 0 1 0 0 0 ...
## $ D : int 0 0 0 1 0 1 2 0 0 0 ...
## $ D..1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ F : int 1 0 1 1 0 0 5 0 0 0 ...
## $ W : int 0 1 0 0 0 0 1 0 0 0 ...
## $ Students : int 22 23 21 33 33 27 192 27 24 22 ...
## $ Primary.Instructor: chr "Shin, Jeongsu" "Shin, Jeongsu" "Lee, Sabrina Y" "Sawada, Emilia" ...
data %>% count()
## n
## 1 69069
#The dataset contains over 56,000 course records from 2010 to 2020.
data <- data %>%
filter(Year >= 2010 & Year < 2021)
colSums(is.na(data))
## Year Term YearTerm Subject
## 0 0 0 0
## Number Course.Title Sched.Type A.
## 0 0 0 0
## A A..1 B. B
## 0 0 0 0
## B..1 C. C C..1
## 0 0 0 0
## D. D D..1 F
## 0 0 0 0
## W Students Primary.Instructor
## 0 0 0
summary(data == "")
## Year Term YearTerm Subject
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:56465 FALSE:56465
##
## Number Course.Title Sched.Type A.
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:46412 FALSE:56465
## TRUE :10053
## A A..1 B. B
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:56465 FALSE:56465
##
## B..1 C. C C..1
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:56465 FALSE:56465
##
## D. D D..1 F
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:56465 FALSE:56465
##
## W Students Primary.Instructor
## Mode :logical Mode :logical Mode :logical
## FALSE:56465 FALSE:56465 FALSE:56301
## TRUE :164
data <- na.omit(data)
#data %>% count()
math <- data %>% filter(data$Subject == "MATH")
#🔍 Insight 1: Which Courses Are Failing Students?
#I first looked at F grades in Math—a high-volume, foundational subject. The question: Are some courses disproportionately hard—or under-supported?
ggplot(math, aes(x = F)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = " Distribution Of F Grades", x = "Number of F Grades", y = "Frequency of Courses")

#🔎 Observation:
# A small number of Math courses had 30+ F grades per term, highlighting potential course difficulty or gaps in student support.
# Let us dig deeper.
top_f_subjects <- data %>%
group_by(Subject) %>%
summarise(Total_Fs = sum(F, na.rm = TRUE)) %>%
arrange(desc(Total_Fs)) %>%
slice_head(n = 10)
ggplot(top_f_subjects, aes(x = reorder(Subject, Total_Fs), y = Total_Fs)) +
geom_col(fill = "firebrick") +
coord_flip() +
labs(title = "Top 10 Subjects by Total F Grades (2010–2020)",
x = "Subject", y = "Total F Grades")

#It turns out MATH isn’t alone. Several departments show disproportionately high F rates. These patterns likely reflect curriculum rigor—or under-supported students. Should these courses be flagged for targeted support or structural redesign?
#Question:
# Instructor Grading Variance (ECON Case Study)
#New idea: Compare grading variation within a single high-enrollment subject.
#Let’s look at ECON— a popular, large classes.
econ <- data %>% filter(Subject == "ECON")
#“Chart excludes extreme outliers above 10 F grades to highlight instructor-level variation within normal range.”
ggplot(econ, aes(y = reorder(Primary.Instructor, F, FUN = median), x = F)) +
geom_boxplot(fill = "darkgreen", outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.4, size = 0.8) +
coord_flip() +
labs(
title = "Distribution of F Grades by Instructor (ECON)",
subtitle = "Instructor-level failure rate variance within ECON courses",
y = "Instructor",
x = "Number of F Grades"
) + coord_cartesian(xlim = c(0, 25)) +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

# Summary Statistics
#The bar graph below depicts the number of students attending courses according to the term. Through the graph, it can be observed that most number of students attended the Fall term while the least number of students attended the Winter Term.
#Term-Based Enrollment Patterns
ggplot(data, aes(x= Term)) +
geom_bar(fill = "skyblue") +
labs(title = "Student Enrollment by Academic Term", x = "Term", y = "Number of Students")

#Fall consistently sees the highest enrollment, while Winter is the least active term.
#The contingency table below depicts the number of courses of each subject offered and divides them by the term in which they were offered.
table <- table(data$Subject, data$Term)
table
##
## Fall Spring Summer Winter
## AAS 93 88 11 0
## ABE 77 56 5 0
## ACCY 1167 959 214 7
## ACE 442 423 48 6
## ACES 135 7 11 0
## ADV 235 199 32 1
## AE 196 144 17 1
## AFRO 64 56 8 1
## AFST 7 0 5 0
## AGCM 16 15 4 0
## AGED 73 54 9 0
## AHS 22 4 5 0
## AIS 36 33 4 0
## ALEC 1 0 0 0
## ANSC 285 276 40 3
## ANTH 383 137 78 3
## ARAB 16 12 2 0
## ARCH 200 170 26 1
## ART 44 43 16 4
## ARTD 114 77 13 1
## ARTE 4 3 0 0
## ARTF 18 3 2 0
## ARTH 70 62 10 0
## ARTS 10 8 2 0
## ASRM 21 12 0 0
## ASTR 94 87 18 7
## ATMS 98 88 21 7
## AVI 6 8 2 0
## BADM 1026 883 245 41
## BCOG 1 1 0 0
## BCS 5 2 0 0
## BIOC 7 23 1 0
## BIOE 124 100 9 0
## BIOP 3 1 1 0
## BTW 273 259 38 1
## BUS 258 22 23 0
## CB 0 2 0 0
## CEE 445 351 53 1
## CHBE 134 118 13 0
## CHEM 903 692 142 4
## CHIN 1 1 0 0
## CHLH 440 424 96 9
## CHP 10 1 0 0
## CI 303 204 46 0
## CINE 2 7 0 0
## CLCV 270 230 37 4
## CMN 660 622 159 5
## CPSC 135 121 15 0
## CS 926 718 136 16
## CW 14 14 0 0
## CWL 39 19 2 0
## DANC 111 115 18 6
## EALC 45 38 7 0
## ECE 861 770 145 7
## ECON 1593 1388 237 38
## EDUC 56 34 2 0
## EIL 10 15 10 0
## ENG 177 71 25 0
## ENGL 517 427 67 0
## ENSU 2 2 5 0
## ENVS 0 4 0 0
## EOL 29 20 6 3
## EPOL 8 2 1 1
## EPS 91 67 38 4
## EPSY 159 133 29 1
## ESE 31 30 8 0
## ESL 1 4 0 0
## EURO 4 4 0 0
## FAA 14 8 3 1
## FIN 707 637 112 5
## FR 67 68 9 1
## FSHN 248 189 33 1
## GCL 22 12 0 0
## GE 80 77 11 0
## GEOG 161 149 33 3
## GEOL 112 110 35 4
## GER 80 71 14 0
## GLBL 93 85 7 0
## GRKM 1 1 0 0
## GS 339 13 23 0
## GWS 94 83 18 0
## HDFS 143 123 27 1
## HIST 553 415 91 2
## HNDI 1 0 0 0
## HORT 108 88 13 0
## HRD 12 22 6 1
## HRE 14 11 4 0
## IB 276 239 46 2
## IE 138 110 10 0
## IHLT 48 36 0 0
## INFO 32 36 3 0
## IS 258 203 77 6
## ITAL 30 23 9 0
## JAPN 13 0 2 1
## JOUR 71 52 11 1
## JS 2 0 0 0
## KIN 496 437 67 2
## KOR 10 4 1 0
## LA 66 42 11 0
## LAS 727 11 66 0
## LAST 42 34 7 0
## LAT 2 3 1 0
## LAW 457 324 45 0
## LEAD 6 8 1 1
## LER 372 314 56 3
## LING 135 94 12 1
## LLS 74 64 7 1
## MACS 282 227 35 3
## MATH 1177 926 214 11
## MBA 158 165 29 10
## MCB 898 886 130 4
## MDIA 46 26 1 0
## ME 325 293 32 2
## MILS 4 0 1 0
## MS 20 49 0 0
## MSE 202 168 18 0
## MUS 284 212 48 12
## MUSE 6 8 1 0
## NPRE 121 129 16 0
## NRES 143 120 15 0
## NS 1 2 0 0
## NUTR 10 5 0 0
## PATH 0 16 0 0
## PHIL 270 218 35 1
## PHYS 401 362 84 6
## PLPA 20 20 2 0
## POL 5 3 1 0
## PORT 2 2 1 0
## PS 307 267 72 7
## PSM 6 3 0 0
## PSYC 857 684 135 19
## REES 0 3 0 0
## REHB 23 21 2 0
## REL 94 65 4 2
## RHET 181 145 72 0
## RLST 220 193 43 0
## RSOC 0 3 0 0
## RST 296 278 54 5
## RUSS 6 4 0 0
## SAME 3 0 0 0
## SBC 16 8 2 1
## SCAN 38 18 6 0
## SCR 0 2 0 0
## SE 42 31 0 0
## SHS 342 321 47 1
## SLAV 6 3 1 0
## SLCL 1 2 0 0
## SOC 329 329 66 11
## SOCW 380 260 82 5
## SPAN 142 136 62 9
## SPED 122 84 22 1
## STAT 377 333 79 9
## SWAH 2 3 0 0
## TAM 149 122 38 3
## TE 40 27 0 0
## THEA 269 236 31 3
## TRST 3 4 0 0
## TSM 91 85 8 0
## UKR 0 4 0 0
## UP 131 151 16 0
## VB 1 3 0 0
## VCM 14 22 1 0
## VM 51 43 4 0
## YDSH 10 2 2 0
#This contingency table shows how many courses were offered per subject across Fall, Spring, Summer, and Winter terms over the 10-year period.
AAS <- data %>% filter(data$Subject == "AAS")
#The box plots below showcases the number of B grades distributed by the each of the courses of the Asian American Studies subjects.
ggplot(AAS, aes(x = Number, y = B, group = Number)) +
geom_boxplot(color = "grey") +
labs(title = "B Grades vs AAS Course Number", y = "Number of B Grades", x = "Course Number")

#Mean
#What is the average number of Students in each class?
average_students <- mean(data$Students)
average_students
## [1] 57.06675
#Explanation: The average class size at UIUC is 57 students, though the high standard deviation (≈72) reflects wide variance between small seminars and large lectures.
#Median
#What is the middle course number for ACCY ?
accy_data <- subset(data, Subject == "ACCY")
median_number <- median(accy_data$Number)
median_number
## [1] 312
#Explanation: The middle course for ACCY is 312, this means that 312 is in the middle of all of the ACCY classes. Perhaps suggesting it's a midpoint between lower- and upper-level ACCY courses..
#Mode
#Which professor teaches the most classes?
find_mode_categorical <- function(x) {
tbl <- table(x)
modes <- names(tbl)[which.max(tbl)]
return(modes)
}
mode_instructor <- find_mode_categorical(data$Primary.Instructor)
mode_instructor
## [1] "Hoffman, Ruth A"
#The most frequently appearing instructor in the dataset was Soranso, Murilli S, suggesting a high teaching load across multiple terms.
#What is the Highest and Lowest course numbers?
#Max
Max_course_number <- max(data$Number)
Max_course_number
## [1] 798
#Min
Min_course_number <- min(data$Number)
Min_course_number
## [1] 2
#Explanation: As we can see the course numbers range from 2-798, this shows that the lowest course number is 2 whilst the highest course number is 798
#Standard Deviation
#What is the variance in class size at UIUC?
Students_standard_deviation <- sd(data$Students)
Students_standard_deviation
## [1] 72.06524
#Explanation: These is standard deviation of 73.64196 in the students section of this dataset. This means that the number of students in each class varys by about 73.64196 from the average number of students. The large standard deviation highlights UIUC’s mix of small seminars and large lecture halls—showing structural variety in how courses are delivered..
#KEY INSIGHTS
#Math and CHEM courses had the highest number of F grades.
#Instructor grading patterns vary widely in high-enrollment departments.
#Fall term consistently had the highest student enrollment.
#Course difficulty does not always align with course number.
#CONCLUSION
#This analysis reveals key grading and enrollment trends across 10 years at UIUC. While some courses and departments consistently show tougher grading patterns, instructor variation also plays a role. Missing demographic data (e.g. gender, ethnicity) and external factors (like socioeconomic status or mental health) limit deeper insight—but the dataset still offers meaningful views on academic structure, fairness, and performance trends.
#This analysis can inform academic advising, curriculum planning, and support program targeting.
#Limitations
#The dataset does not include student demographic information, course prerequisites, or grade distributions per student. External factors like academic background, socioeconomic status, or health conditions are not captured, which limits causal inference.
#Call to Action
#This analysis could be expanded with student-level data or tied to advising strategies. I’d be excited to bring this approach to real institutional planning or policy evaluation.