Examining the Skillsets of PGA Golfers From 2010-2020 and the Relationships to Scoring Average, Top Ten Finishes, and Seasonal Earnings

By Sam Marteka – Syracuse University ’21

Abstract

This research aims to analyze the changing landscape of how golfers on the PGA Tour perform and the statistical changes that occur over the career of these golfers. We then observe these statistical changes and evaluate the effect of these changes on seasonal earnings, top ten finishes, and scoring average over time using linear regression models. The dataset is from 2010 to 2020 and includes over 100 different cumulative statistics of every eligible PGA Tour golfer that played in a PGA Tour event during any season from 2010 to 2020.

Introduction

Compared to the four major North American sports, golf is relatively under-explored when it comes to data analytics. During the entirety of quarantine due to the worldwide coronavirus pandemic, I have been playing golf frequently and watching it on television, as it was the first sport to resurface. I found a new passion for consuming golf both as leisure and as a fan of the PGA Tour. For my senior thesis, I wanted to explore a variety of topics that focuses on the performance of PGA golfers over time. Current superstar golfers like Bryson DeChambeau have revolutionized the game as he has built his game around driving the ball significantly farther than his competition. DeChambeau inspired me to further explore how the skillsets of successful golfers have changed over time and as I navigated this thesis, I was very interested on how different skills contributed to the success of golfers on three different levels: percent of top ten finishes, scoring average, and seasonal earnings.

Methodology

This project required a multitude of analytical skills that I have learned over the last four years as a student at SU. The first steps in this process was utilizing my web scraping skills using the Python coding language with packages Beautiful Soup and Selenium. I was able to create a Python script that loaded in PGA Tour links for each statistic on the website from every year from 2010-2020. I then used the Pandas package in Python to wrangle each dataset with a different statistic into a massive dataset that contained every stat and year combination. I ended up with over 300,000 observations in the dataset, which was then exported to R where additional data wrangling and cleaning processes could be performed to prepare the data for linear modeling. Once the data was cleaned, I created multiple additional data frames in R which filtered the data by year or a group of years. Then, I ran linear regression models with three dependent variables: scoring average, seasonal earnings, and top ten finish percent. I used various year intervals to see if there were significant changes that occurred with player skillsets over different periods of time within the range of 2010 and 2020.

Top Ten Percentage on Shots Gained Results

Progression of the Four Different Shots Gained Metrics From 2015-2020

Seasonal Earnings on Statistics Results

Scoring AVG on Shots Gained Results

AVG Earnings by Year

Relationship Between Scrambling and Driving Distance on Top Ten Finishes from 2010-2015

Relationship Between Shots Gained Off The Tee and Shots Gained Putting on Seasonal Earnings from 2010-2020

Relationship Between Par 3 Scoring and Par 5 Scoring on Earnings from 2010-2020