Who has Championship Games with Larger Point Differentials and More Blowouts: NCAA or NFL?

7 minute read

To understand why I performed these analyses and made this notebook, it might help to have some context. So recently, my boss and I were discussing the 2020-2021 Super Bowl game. During our conversation, we had mentioned how surprised we were by the large point differential between the Tampa Bay Buccaneers and Kansas City Chiefs. Tampa Bay beat Kansas by 22 points, which to me seemed like a relatively large point differential for a professional football game. Now I will admit that I tend to watch more college than professional football, and thus have more experience with NCAA games. In the NCAA you can have large point differentials in games (i.e. Clemson beat Alabama by 28 points during the 2018-2019 year). This may be a result of particular conferences monopolizing better players/recruiting classes due to reputation.

All of this made me think that maybe the NCAA, on average, has larger point differentials in championship games than in professional football. Also, maybe it was possible the the NCAA had more blowout championship games than the NFL. So I wanted to test these hypotheses by gathering Super Bowl and NCAA championship data.

Load libraries

library(rvest) # For web scraping
library(tidyverse) # For Data cleaning
library(infer) #For Permutation testing

NFL Data

Web scrape Super Bowl data

NFL Data obtained from - https://www.pro-football-reference.com/super-bowl/

superbowl_df <- html_session('https://www.pro-football-reference.com/super-bowl/') %>%
  read_html() %>%
  html_node(xpath = '//*[@id="super_bowls"]') %>%
  html_table()

#Change column names
colnames(superbowl_df)[c(4, 6)] <- c('winning_team', 'losing_team')
head(superbowl_df)

	Date <chr>	SB <chr>	Winner <chr>	winning_team <int>	Loser <chr>
1	Feb 7, 2021	LV (55)	Tampa Bay Buccaneers	31	Kansas City Chiefs
2	Feb 2, 2020	LIV (54)	Kansas City Chiefs	31	San Francisco 49ers
3	Feb 3, 2019	LIII (53)	New England Patriots	13	Los Angeles Rams
4	Feb 4, 2018	LII (52)	Philadelphia Eagles	41	New England Patriots
5	Feb 5, 2017	LI (51)	New England Patriots	34	Atlanta Falcons
6	Feb 7, 2016	50	Denver Broncos	24	Carolina Panthers

Filter data and calculate point differentials for Super Bowl

Please note that I will only be looking at data from 1990 to 2019

superbowl_filtered <- superbowl_df %>%
  mutate(year = as.integer(str_extract_all(Date, "[0-9]{4}")), point_differential =  winning_team - losing_team) %>%
  filter(year >= 1990 & year <= 2019) %>%
  select(year, point_differential)

Calculate descriptive statistics

summary(superbowl_filtered$point_differential)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.00   10.00   12.80   14.75   45.00

NCAA Data

Web scrape NCAA football National championship data

NCAA Data obtained from - http://championshiphistory.com/ncaafootball.php

NCAA_df <- html_session('http://championshiphistory.com/ncaafootball.php') %>%
  read_html() %>%
  html_node(xpath = '//*[@id="tablesorter-demo"]') %>%
  html_table(fill = T)

NCAA_df <- na.omit(NCAA_df)
head(NCAA_df)

	Year <int>	Era <chr>	#1 CollegeTeam <chr>	RunnerUp <chr>	Championship Score <chr>
1	2019	CFP	LSU Tigers	Clemson	42-25
3	2018	CFP	Clemson	Alabama	44-16
5	2017	CFP	Alabama	Georgia	26-23
7	2016	CFP	Clemson	Alabama	35-31
9	2015	CFP	Alabama	Clemson	45-40
11	2014	CFP	Ohio State	Oregon	42-20

Parse out championship scores for NCAA

# Calculate number of rows in NCAA data frame
num_games <- nrow(NCAA_df)

#Create variable for storing winning team point results
winning_team <- NULL

#Create variable for storing losing team point results
losing_team <- NULL

# Create variable for storing year
year <- NULL
for(curr_game in 1:num_games){
  
  #Create if statement for detecting rows in championship score column without scores
  if(str_length(NCAA_df$`Championship Score`[curr_game]) != 0){
  scores <- str_split(NCAA_df$`Championship Score`[curr_game], "-")[[1]]
  winning_team[curr_game] <- scores[1]
  losing_team[curr_game] <- scores[2]
  year[curr_game] <- NCAA_df$Year[curr_game]
  }
  #If there are no scores present skip iteration
  else{
    next
  }

}

# Create data frame from variables that had data appended to them in for loop
NCAA_scores <- data.frame(year = unlist(year), winning_score = as.integer(unlist(winning_team)), losing_score = as.integer(unlist(losing_team)))

Calculate NCAA point differentials

NCAA_filtered <- NCAA_scores %>%
  filter(year >= 1990 & year <= 2019) %>%
  mutate(point_differential =  winning_score - losing_score) %>%
  select(year, point_differential)

head(NCAA_filtered)

Calculate descriptive statistics

summary(NCAA_filtered$point_differential)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.00   12.50   14.67   22.00   38.00

Combine Superbowl and NCAA filtered data frames for data visualization

# Add Super Bowl Label
superbowl_filtered$organization <- "NFL"

# Add NCAA Label
NCAA_filtered$organization <- "NCAA"

#Combine data
all_data <- rbind(superbowl_filtered, NCAA_filtered)
head(all_data)

Graph data

# Create boxplot
ggplot(data = all_data, aes(x = organization, y = point_differential, color = organization)) +
  geom_boxplot(lwd = 1) +
  geom_jitter() +
  ggtitle("Distribution of Super Bowl and NCAA Championship Point Differentials") +
  ylab("Point Differential") +
  xlab("Organization") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = 'black'), 
        title = element_text(hjust = 0.5, face = 'bold'),
        axis.text.x = element_text(face = 'bold', size = 10, color = 'black'),
        axis.text.y = element_text(face = 'bold', size = 10, color = 'black'),
        axis.title = element_text(color = 'black', face = 'bold'))

png

From the above figure, we can see that NFL Super Bowl games have a slightly lower median compared to the NCAA games. It is important to realize that this does not mean that there is statistically significant difference between the two groups. In order to determine if this difference is meaningful we will need to perform some inferential statistics.

We can use permutation testing to see if there is a meaningful difference between the means of the NFL and NCAA point differentials. Briefly, permutation testing allows us to create a null distribution of a statistic of interest (In this case the difference in average point differentials) by shuffling the labels (NCAA and NFL) belonging to the point differentials. We then can generate the mean difference from the resulting shuffle and do this repeatedly.

Permutation Testing: Is there a meaningful difference in the average point differential between the NCAA and NFL championship games?

Calculate observed mean difference in point differential between the NFL and NCAA games

obs_diff <- all_data %>%
  specify(point_differential ~ organization) %>%
  calculate(stat = "diff in means", order = c("NCAA", "NFL"))

#Show observed difference  
obs_diff

stat <dbl>
1.866667

Create Null distribution

null_distrubution <- all_data %>%
  specify(point_differential ~ organization) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("NCAA", "NFL"))

Visualize the null distribution and calculate p-value

#visualize null distribution
visualise(null_distrubution, bins = 15) +
  shade_p_value(obs_stat = obs_diff, direction = "right")

png

Calculate p-value

#Calculate p-value
null_distrubution %>%
  get_p_value(obs_stat = obs_diff, direction = 'right')

p_value <dbl>
0.246

Since the p-value is greater than 0.5, we fail to reject the null hypothesis that the average point differential for Super Bowl games is the same as the average point differential for NCAA championship games. Thus, there is no meaningful difference between the two means.

Is There a Difference In the Proportion of “Blowout” Championship Games Between NCAA and NFL

Though there does not seem to be any agreement on what is a blowout game, it looks like many individuals agree with a number around 17 and 20 plus points.

I am going to make assumption that a 20 point differential is what constitutes a blowout game.

Calculate the number of blowout games

blowouts <- all_data %>%
  mutate(game_status = if_else(point_differential >= 20, true = "blowout", false = "non-blowout")) %>%
  group_by(organization) %>%
  count(game_status) 
  

Visualize data

ggplot(data = blowouts, aes(x = organization, y = n, fill = game_status)) +
  geom_col(color = 'black') +
    ggtitle("Number of Games that were blowout and non-blowout games") +
  ylab("Frequency") +
  xlab("Organization") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = 'black'), 
        title = element_text(hjust = 0.5, face = 'bold'),
        axis.text.x = element_text(face = 'bold', size = 10, color = 'black'),
        axis.text.y = element_text(face = 'bold', size = 10, color = 'black'),
        axis.title = element_text(color = 'black', face = 'bold'))

png

Based on the above figure, there seems to be a larger proportion of NCAA championship games that ended up with a blowout game.

I thought it would be a good idea to calculate the actual percentage of games that were blowouts for both organizations.

Calculate difference in proportion of blowout game between NFL and NCAA Games

blowouts %>%
  group_by(organization) %>%
  mutate(group_total = sum(n)) %>%
  ungroup() %>%
  group_by(organization, game_status) %>%
  summarise(percentage = round((n/group_total)*100, 2)) %>%
  filter(game_status == "blowout")

organization <chr>	game_status <chr>	percentage <dbl>
NCAA	blowout	36.67
NFL	blowout	20.00

So, we can see that the proportion of games that I considered “blowouts” for the NCAA and NFL championship games was 36.67% and 20% respectively

This means that blowout games occurred 16.67% more in NCAA title games than in Super Bowl games in our sample.

Again, the question is whether or not this is statistically meaningful.

Permutation Testing: Is there a meaningful difference in the proportion of championship blowout games Between NCAA and NFL?

Calculate the observed difference in proportion of championship blowout games between NCAA and NFL

#create new df
blowouts <- all_data %>%
  mutate(game_status = if_else(point_differential >= 20, true = "blowout", false = "non-blowout"))

#Calculate obs difference
obs_diff <- blowouts %>%
  specify(game_status ~ organization, success = "blowout") %>%
  calculate(stat = "diff in props", order = c("NCAA", "NFL"))

#Show observed difference  
obs_diff

stat <dbl>
0.1666667

Create Null distribution

null_distrubution <- blowouts %>%
  specify(game_status ~ organization, success = "blowout") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("NCAA", "NFL"))

Visualize the null distribution and calculate p-value

set.seed(1234)

#visualize null distribution
visualise(null_distrubution, bins = 10) +
  shade_p_value(obs_stat = obs_diff, direction = "right")

png

Calculate p-value

#Calculate p-value
null_distrubution %>%
  get_p_value(obs_stat = obs_diff, direction = 'right')

p_value <dbl>
0.121

As with the point differential section, we fail to reject the null hypothesis p > 0.5.

Concluding Remarks

Well, this was a really fun experiment and definitely shows why it is important to perform these kinds of analyses. We can all be subject to the availability heuristic, which makes us evaluate certain topics based on examples that come readily to our minds. I felt really confident that college championship games would more likely have larger point differentials or even more blowout games, but this was not the case.

Hope you enjoyed this notebook and feel free to email me if you have any questions about these data or the code!

Twitter Facebook Google+ LinkedIn

Steven Pilato

Hey everyone! So, I wanted to share an analysis I performed on NCAA and NFL championship data.

Load libraries

NFL Data

Web scrape Super Bowl data

NFL Data obtained from - https://www.pro-football-reference.com/super-bowl/

Filter data and calculate point differentials for Super Bowl

Please note that I will only be looking at data from 1990 to 2019

Calculate descriptive statistics

NCAA Data

Web scrape NCAA football National championship data

NCAA Data obtained from - http://championshiphistory.com/ncaafootball.php

Parse out championship scores for NCAA

Calculate NCAA point differentials

Calculate descriptive statistics

Combine Superbowl and NCAA filtered data frames for data visualization

Graph data

Permutation Testing: Is there a meaningful difference in the average point differential between the NCAA and NFL championship games?

Calculate observed mean difference in point differential between the NFL and NCAA games

Create Null distribution

Visualize the null distribution and calculate p-value

Calculate p-value

Since the p-value is greater than 0.5, we fail to reject the null hypothesis that the average point differential for Super Bowl games is the same as the average point differential for NCAA championship games. Thus, there is no meaningful difference between the two means.

Is There a Difference In the Proportion of “Blowout” Championship Games Between NCAA and NFL

Though there does not seem to be any agreement on what is a blowout game, it looks like many individuals agree with a number around 17 and 20 plus points.

I am going to make assumption that a 20 point differential is what constitutes a blowout game.

Calculate the number of blowout games

Visualize data

Based on the above figure, there seems to be a larger proportion of NCAA championship games that ended up with a blowout game.

I thought it would be a good idea to calculate the actual percentage of games that were blowouts for both organizations.

Calculate difference in proportion of blowout game between NFL and NCAA Games

So, we can see that the proportion of games that I considered “blowouts” for the NCAA and NFL championship games was 36.67% and 20% respectively

This means that blowout games occurred 16.67% more in NCAA title games than in Super Bowl games in our sample.

Again, the question is whether or not this is statistically meaningful.

Permutation Testing: Is there a meaningful difference in the proportion of championship blowout games Between NCAA and NFL?

Calculate the observed difference in proportion of championship blowout games between NCAA and NFL

Create Null distribution

Visualize the null distribution and calculate p-value

Calculate p-value

As with the point differential section, we fail to reject the null hypothesis p > 0.5.

Concluding Remarks

Hope you enjoyed this notebook and feel free to email me if you have any questions about these data or the code!

You May Also Enjoy

Mario Party Superstars: Which die block should you buy from the item store?

Predicting Airline Customer Satisfaction Ratings

Creating a United States CO2 Emissions Streamlit Application

Predicting Fake News with a Convolutional Neural Network