R Markdown

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
## BUSINESS QUESTION
#What can 10 years of GPA data at UIUC reveal about grade distribution, course difficulty, and fairness across departments and instructors?

#Most students experience GPA as a number. Departments see it as a metric

#Grades are often treated as fixed outcomes—but when viewed at scale, they reflect broader academic systems, teaching practices, and student support structures.

#This project was a deep dive into 10 years of GPA data from the University of Illinois (2010–2020)—not just to visualize the past, but to identify opportunities for a better academic future.


#METHODOLOGY
#The dataset includes over 56,000 course-level records from the University of Illinois, covering the years 2010 to 2020. I cleaned the data using tidyverse, removed missing values, and performed exploratory analysis using grouping, histograms, boxplots, and contingency tables. Visualization was done using ggplot2.


data <- read.csv("/Users/jayeshghosh/Desktop/gpa.csv")


#Skills Used:
#R, Tidyverse, ggplot2, Data Cleaning, Statistical Summaries, Data Visualization, Educational Analytics
summary(data)
##       Year          Term             YearTerm           Subject         
##  Min.   :2010   Length:69069       Length:69069       Length:69069      
##  1st Qu.:2013   Class :character   Class :character   Class :character  
##  Median :2016   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2020                                                           
##  Max.   :2023                                                           
##      Number      Course.Title        Sched.Type              A.         
##  Min.   :  2.0   Length:69069       Length:69069       Min.   :  0.000  
##  1st Qu.:133.0   Class :character   Class :character   1st Qu.:  0.000  
##  Median :272.0   Mode  :character   Mode  :character   Median :  2.000  
##  Mean   :293.5                                         Mean   :  7.269  
##  3rd Qu.:424.0                                         3rd Qu.:  7.000  
##  Max.   :798.0                                         Max.   :929.000  
##        A                A..1               B.                B          
##  Min.   :   0.00   Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.:   6.00   1st Qu.:  2.000   1st Qu.:  1.000   1st Qu.:  2.000  
##  Median :  11.00   Median :  5.000   Median :  4.000   Median :  4.000  
##  Mean   :  18.79   Mean   :  7.618   Mean   :  5.879   Mean   :  7.369  
##  3rd Qu.:  20.00   3rd Qu.:  9.000   3rd Qu.:  7.000   3rd Qu.:  8.000  
##  Max.   :1034.00   Max.   :351.000   Max.   :210.000   Max.   :250.000  
##       B..1               C.                C                C..1       
##  Min.   :  0.000   Min.   :  0.000   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.: 0.000  
##  Median :  1.000   Median :  1.000   Median :  1.000   Median : 0.000  
##  Mean   :  3.182   Mean   :  2.144   Mean   :  2.705   Mean   : 1.169  
##  3rd Qu.:  4.000   3rd Qu.:  2.000   3rd Qu.:  3.000   3rd Qu.: 1.000  
##  Max.   :144.000   Max.   :138.000   Max.   :111.000   Max.   :66.000  
##        D.                D              D..1               F        
##  Min.   : 0.0000   Min.   : 0.00   Min.   : 0.0000   Min.   : 0.00  
##  1st Qu.: 0.0000   1st Qu.: 0.00   1st Qu.: 0.0000   1st Qu.: 0.00  
##  Median : 0.0000   Median : 0.00   Median : 0.0000   Median : 0.00  
##  Mean   : 0.5979   Mean   : 0.82   Mean   : 0.3607   Mean   : 1.18  
##  3rd Qu.: 0.0000   3rd Qu.: 1.00   3rd Qu.: 0.0000   3rd Qu.: 1.00  
##  Max.   :45.0000   Max.   :61.00   Max.   :29.0000   Max.   :75.00  
##        W            Students       Primary.Instructor
##  Min.   : 0.00   Min.   :  21.00   Length:69069      
##  1st Qu.: 0.00   1st Qu.:  26.00   Class :character  
##  Median : 0.00   Median :  34.00   Mode  :character  
##  Mean   : 0.23   Mean   :  59.08                     
##  3rd Qu.: 0.00   3rd Qu.:  55.00                     
##  Max.   :16.00   Max.   :1913.00
str(data)
## 'data.frame':    69069 obs. of  23 variables:
##  $ Year              : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ Term              : chr  "Spring" "Spring" "Spring" "Spring" ...
##  $ YearTerm          : chr  "2023-sp" "2023-sp" "2023-sp" "2023-sp" ...
##  $ Subject           : chr  "AAS" "AAS" "AAS" "AAS" ...
##  $ Number            : int  100 100 100 200 215 141 152 232 425 457 ...
##  $ Course.Title      : chr  "Intro Asian American Studies" "Intro Asian American Studies" "Intro Asian American Studies" "U.S. Race and Empire" ...
##  $ Sched.Type        : chr  "DIS" "DIS" "DIS" "LCD" ...
##  $ A.                : int  0 0 0 6 16 0 55 3 0 1 ...
##  $ A                 : int  11 17 13 15 12 5 115 13 22 14 ...
##  $ A..1              : int  5 2 2 5 2 4 0 5 0 4 ...
##  $ B.                : int  1 1 2 2 1 6 4 5 2 0 ...
##  $ B                 : int  4 2 0 3 1 5 9 1 0 2 ...
##  $ B..1              : int  0 1 2 0 1 2 0 0 0 0 ...
##  $ C.                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ C                 : int  0 0 0 0 0 3 1 0 0 1 ...
##  $ C..1              : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ D.                : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ D                 : int  0 0 0 1 0 1 2 0 0 0 ...
##  $ D..1              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ F                 : int  1 0 1 1 0 0 5 0 0 0 ...
##  $ W                 : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ Students          : int  22 23 21 33 33 27 192 27 24 22 ...
##  $ Primary.Instructor: chr  "Shin, Jeongsu" "Shin, Jeongsu" "Lee, Sabrina Y" "Sawada, Emilia" ...
data %>% count()
##       n
## 1 69069
#The dataset contains over 56,000 course records from 2010 to 2020.

data <- data %>%
  filter(Year >= 2010 & Year < 2021)

colSums(is.na(data))
##               Year               Term           YearTerm            Subject 
##                  0                  0                  0                  0 
##             Number       Course.Title         Sched.Type                 A. 
##                  0                  0                  0                  0 
##                  A               A..1                 B.                  B 
##                  0                  0                  0                  0 
##               B..1                 C.                  C               C..1 
##                  0                  0                  0                  0 
##                 D.                  D               D..1                  F 
##                  0                  0                  0                  0 
##                  W           Students Primary.Instructor 
##                  0                  0                  0
summary(data == "")
##     Year            Term          YearTerm        Subject       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:56465     FALSE:56465     FALSE:56465     FALSE:56465    
##                                                                 
##    Number        Course.Title    Sched.Type          A.         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:56465     FALSE:56465     FALSE:46412     FALSE:56465    
##                                  TRUE :10053                    
##      A              A..1             B.              B          
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:56465     FALSE:56465     FALSE:56465     FALSE:56465    
##                                                                 
##     B..1             C.              C              C..1        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:56465     FALSE:56465     FALSE:56465     FALSE:56465    
##                                                                 
##      D.              D              D..1             F          
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:56465     FALSE:56465     FALSE:56465     FALSE:56465    
##                                                                 
##      W            Students       Primary.Instructor
##  Mode :logical   Mode :logical   Mode :logical     
##  FALSE:56465     FALSE:56465     FALSE:56301       
##                                  TRUE :164
data <- na.omit(data)

#data %>% count()
math <- data %>% filter(data$Subject == "MATH")



#🔍 Insight 1: Which Courses Are Failing Students?

#I first looked at F grades in Math—a high-volume, foundational subject. The question: Are some courses disproportionately hard—or under-supported?
ggplot(math, aes(x = F)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") + 
  labs(title = " Distribution Of F Grades", x = "Number of F Grades", y = "Frequency of Courses")

#🔎 Observation:

#    A small number of Math courses had 30+ F grades per term, highlighting potential course difficulty or gaps in student support.

# Let us dig deeper.


top_f_subjects <- data %>%
  group_by(Subject) %>%
  summarise(Total_Fs = sum(F, na.rm = TRUE)) %>%
  arrange(desc(Total_Fs)) %>%
  slice_head(n = 10)

ggplot(top_f_subjects, aes(x = reorder(Subject, Total_Fs), y = Total_Fs)) +
  geom_col(fill = "firebrick") +
  coord_flip() +
  labs(title = "Top 10 Subjects by Total F Grades (2010–2020)",
       x = "Subject", y = "Total F Grades")

#It turns out MATH isn’t alone. Several departments show disproportionately high F rates. These patterns likely reflect curriculum rigor—or under-supported students. Should these courses be flagged for targeted support or structural redesign?

#Question:
  
#  Instructor Grading Variance (ECON Case Study)

#New idea: Compare grading variation within a single high-enrollment subject.

#Let’s look at ECON— a popular, large classes.

econ <- data %>% filter(Subject == "ECON")

#“Chart excludes extreme outliers above 10 F grades to highlight instructor-level variation within normal range.”

ggplot(econ, aes(y = reorder(Primary.Instructor, F, FUN = median), x = F)) +
  geom_boxplot(fill = "darkgreen", outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 0.8) +
  coord_flip() +
  labs(
    title = "Distribution of F Grades by Instructor (ECON)",
    subtitle = "Instructor-level failure rate variance within ECON courses",
    y = "Instructor",
    x = "Number of F Grades"
  ) +  coord_cartesian(xlim = c(0, 25)) +
  theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

# Summary Statistics


#The bar graph below depicts the number of students attending courses according to the term. Through the graph, it can be observed that most number of students attended the Fall term while the least number of students attended the Winter Term.

#Term-Based Enrollment Patterns
ggplot(data, aes(x= Term)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Student Enrollment by Academic Term", x = "Term", y = "Number of Students")

#Fall consistently sees the highest enrollment, while Winter is the least active term.

#The contingency table below depicts the number of courses of each subject offered and divides them by the term in which they were offered.  
table <- table(data$Subject, data$Term)
table
##       
##        Fall Spring Summer Winter
##   AAS    93     88     11      0
##   ABE    77     56      5      0
##   ACCY 1167    959    214      7
##   ACE   442    423     48      6
##   ACES  135      7     11      0
##   ADV   235    199     32      1
##   AE    196    144     17      1
##   AFRO   64     56      8      1
##   AFST    7      0      5      0
##   AGCM   16     15      4      0
##   AGED   73     54      9      0
##   AHS    22      4      5      0
##   AIS    36     33      4      0
##   ALEC    1      0      0      0
##   ANSC  285    276     40      3
##   ANTH  383    137     78      3
##   ARAB   16     12      2      0
##   ARCH  200    170     26      1
##   ART    44     43     16      4
##   ARTD  114     77     13      1
##   ARTE    4      3      0      0
##   ARTF   18      3      2      0
##   ARTH   70     62     10      0
##   ARTS   10      8      2      0
##   ASRM   21     12      0      0
##   ASTR   94     87     18      7
##   ATMS   98     88     21      7
##   AVI     6      8      2      0
##   BADM 1026    883    245     41
##   BCOG    1      1      0      0
##   BCS     5      2      0      0
##   BIOC    7     23      1      0
##   BIOE  124    100      9      0
##   BIOP    3      1      1      0
##   BTW   273    259     38      1
##   BUS   258     22     23      0
##   CB      0      2      0      0
##   CEE   445    351     53      1
##   CHBE  134    118     13      0
##   CHEM  903    692    142      4
##   CHIN    1      1      0      0
##   CHLH  440    424     96      9
##   CHP    10      1      0      0
##   CI    303    204     46      0
##   CINE    2      7      0      0
##   CLCV  270    230     37      4
##   CMN   660    622    159      5
##   CPSC  135    121     15      0
##   CS    926    718    136     16
##   CW     14     14      0      0
##   CWL    39     19      2      0
##   DANC  111    115     18      6
##   EALC   45     38      7      0
##   ECE   861    770    145      7
##   ECON 1593   1388    237     38
##   EDUC   56     34      2      0
##   EIL    10     15     10      0
##   ENG   177     71     25      0
##   ENGL  517    427     67      0
##   ENSU    2      2      5      0
##   ENVS    0      4      0      0
##   EOL    29     20      6      3
##   EPOL    8      2      1      1
##   EPS    91     67     38      4
##   EPSY  159    133     29      1
##   ESE    31     30      8      0
##   ESL     1      4      0      0
##   EURO    4      4      0      0
##   FAA    14      8      3      1
##   FIN   707    637    112      5
##   FR     67     68      9      1
##   FSHN  248    189     33      1
##   GCL    22     12      0      0
##   GE     80     77     11      0
##   GEOG  161    149     33      3
##   GEOL  112    110     35      4
##   GER    80     71     14      0
##   GLBL   93     85      7      0
##   GRKM    1      1      0      0
##   GS    339     13     23      0
##   GWS    94     83     18      0
##   HDFS  143    123     27      1
##   HIST  553    415     91      2
##   HNDI    1      0      0      0
##   HORT  108     88     13      0
##   HRD    12     22      6      1
##   HRE    14     11      4      0
##   IB    276    239     46      2
##   IE    138    110     10      0
##   IHLT   48     36      0      0
##   INFO   32     36      3      0
##   IS    258    203     77      6
##   ITAL   30     23      9      0
##   JAPN   13      0      2      1
##   JOUR   71     52     11      1
##   JS      2      0      0      0
##   KIN   496    437     67      2
##   KOR    10      4      1      0
##   LA     66     42     11      0
##   LAS   727     11     66      0
##   LAST   42     34      7      0
##   LAT     2      3      1      0
##   LAW   457    324     45      0
##   LEAD    6      8      1      1
##   LER   372    314     56      3
##   LING  135     94     12      1
##   LLS    74     64      7      1
##   MACS  282    227     35      3
##   MATH 1177    926    214     11
##   MBA   158    165     29     10
##   MCB   898    886    130      4
##   MDIA   46     26      1      0
##   ME    325    293     32      2
##   MILS    4      0      1      0
##   MS     20     49      0      0
##   MSE   202    168     18      0
##   MUS   284    212     48     12
##   MUSE    6      8      1      0
##   NPRE  121    129     16      0
##   NRES  143    120     15      0
##   NS      1      2      0      0
##   NUTR   10      5      0      0
##   PATH    0     16      0      0
##   PHIL  270    218     35      1
##   PHYS  401    362     84      6
##   PLPA   20     20      2      0
##   POL     5      3      1      0
##   PORT    2      2      1      0
##   PS    307    267     72      7
##   PSM     6      3      0      0
##   PSYC  857    684    135     19
##   REES    0      3      0      0
##   REHB   23     21      2      0
##   REL    94     65      4      2
##   RHET  181    145     72      0
##   RLST  220    193     43      0
##   RSOC    0      3      0      0
##   RST   296    278     54      5
##   RUSS    6      4      0      0
##   SAME    3      0      0      0
##   SBC    16      8      2      1
##   SCAN   38     18      6      0
##   SCR     0      2      0      0
##   SE     42     31      0      0
##   SHS   342    321     47      1
##   SLAV    6      3      1      0
##   SLCL    1      2      0      0
##   SOC   329    329     66     11
##   SOCW  380    260     82      5
##   SPAN  142    136     62      9
##   SPED  122     84     22      1
##   STAT  377    333     79      9
##   SWAH    2      3      0      0
##   TAM   149    122     38      3
##   TE     40     27      0      0
##   THEA  269    236     31      3
##   TRST    3      4      0      0
##   TSM    91     85      8      0
##   UKR     0      4      0      0
##   UP    131    151     16      0
##   VB      1      3      0      0
##   VCM    14     22      1      0
##   VM     51     43      4      0
##   YDSH   10      2      2      0
#This contingency table shows how many courses were offered per subject across Fall, Spring, Summer, and Winter terms over the 10-year period.

AAS <- data %>% filter(data$Subject == "AAS")

#The box plots below showcases the number of B grades distributed by the each of the courses of the Asian American Studies subjects.
ggplot(AAS, aes(x = Number, y = B, group = Number)) +
  geom_boxplot(color = "grey") +
  labs(title = "B Grades vs AAS Course Number", y = "Number of B Grades", x = "Course Number")

#Mean
#What is the average number of Students in each class?
average_students <- mean(data$Students)
average_students
## [1] 57.06675
#Explanation: The average class size at UIUC is 57 students, though the high standard deviation (≈72) reflects wide variance between small seminars and large lectures.

#Median
#What is the middle course number for ACCY ?
accy_data <- subset(data, Subject == "ACCY")
median_number <- median(accy_data$Number)
median_number
## [1] 312
#Explanation: The middle course for ACCY is 312, this means that 312 is in the middle of all of the ACCY classes. Perhaps suggesting it's a midpoint between lower- and upper-level ACCY courses..

#Mode
#Which professor teaches the most classes?
find_mode_categorical <- function(x) {
  tbl <- table(x)
  modes <- names(tbl)[which.max(tbl)]
  return(modes)
}
mode_instructor <- find_mode_categorical(data$Primary.Instructor)
mode_instructor
## [1] "Hoffman, Ruth A"
#The most frequently appearing instructor in the dataset was Soranso, Murilli S, suggesting a high teaching load across multiple terms.

#What is the Highest and Lowest course numbers?
#Max
Max_course_number <- max(data$Number)
Max_course_number
## [1] 798
#Min
Min_course_number <- min(data$Number)
Min_course_number
## [1] 2
#Explanation: As we can see the course numbers range from 2-798, this shows that the lowest course number is 2 whilst the highest course number is 798

#Standard Deviation
#What is the variance in class size at UIUC?
Students_standard_deviation <- sd(data$Students)
Students_standard_deviation
## [1] 72.06524
#Explanation: These is standard deviation of 73.64196 in the students section of this dataset. This means that the number of students in each class varys by about 73.64196 from the average number of students. The large standard deviation highlights UIUC’s mix of small seminars and large lecture halls—showing structural variety in how courses are delivered.. 


#KEY INSIGHTS
#Math and CHEM courses had the highest number of F grades.
#Instructor grading patterns vary widely in high-enrollment departments.
#Fall term consistently had the highest student enrollment.
#Course difficulty does not always align with course number.


#CONCLUSION

#This analysis reveals key grading and enrollment trends across 10 years at UIUC. While some courses and departments consistently show tougher grading patterns, instructor variation also plays a role. Missing demographic data (e.g. gender, ethnicity) and external factors (like socioeconomic status or mental health) limit deeper insight—but the dataset still offers meaningful views on academic structure, fairness, and performance trends.

#This analysis can inform academic advising, curriculum planning, and support program targeting.

#Limitations
#The dataset does not include student demographic information, course prerequisites, or grade distributions per student. External factors like academic background, socioeconomic status, or health conditions are not captured, which limits causal inference.


#Call to Action
#This analysis could be expanded with student-level data or tied to advising strategies. I’d be excited to bring this approach to real institutional planning or policy evaluation.