Soccer Real Plus Minus - Sport Analytics Student Research

A More Accurate Measure of Player Value in the World of Soccer.

Contributors:

Drew DiSanto, drdisant@syr.edu
James Hyman, jmhyman@syr.edu
Kevin Ivers, kivers@syr.edu
Andrew Kelly, afkelly@syr.edu
Connor Meissner, cmeissne@syr.edu
Cameron Mitchell, camitche@syr.edu
Seth Quinn, sfquinn@syr.edu
Max Rothermund, mrotherm@syr.edu
Dominic Samangy, dsamangy@syr.edu
Kushal Shah, kshah07@syr.edu
Davis Showell, drshowel@syr.edu

Acknowledgements

We would like to thank Dr. Justin Ehrlich (jaehrlic@syr.edu) and Dr. Shane Sanders (sdsander@syr.edu) for the help with this project.

Abstract

Real Plus Minus (RPM) is an advanced statistic used in major team sports that was created in order to assess a player’s true value to their team. Rather than just simply tracking what the score differential was when the player was on the court/field, as the standard statistic box plus minus does, RPM aims to factor in other statistics and other players in the game with the player of interest in order to get a true assessment of the player’s value. RPM rewards players when they play against tougher opponents or with weaker teammates. The idea was first popularized in basketball and has expanded into other sports such as hockey. For this paper, we wanted to try and expand this relatively new and popular metric into the game of soccer. While lineups do not necessarily change as much during a soccer match as during other team sports, they can change quite a bit from game to game and thus an RPM metric can be used in order to set the best possible eleven for a given match. We found the results of our study to be intriguing and match up fairly well with what the traditional soccer fan who follows the game may expect but also showcases some players who are underrated and can be valuable targets during the transfer window.

Data

To create our real plus-minus metric, we collected stint data for every game throughout the season. A stint is defined by the 22 total players on the field. If the 22 players change, either due to a red card or a substitution, it marks the creation of a new stint. Using the play-by-play commentary from the official Premier League website, we recorded how many times each event happened during each stint, while also noting the players involved in each stint and the amount of time played per stint. The data we collected was for the 2017-2018 Premier League season for every team. Each line of the data frame includes every player that played in the Premier League during the season, with a 1 under the names of the players who played during that specific stint for the home team, a -1 under the players who played for the away team, and a 0 for everyone who was not playing in that stint. Each line also includes the start and end time of each stint, as well as the number of different, important statistics that occurred during the stint. The length of the stint allows us to create per 90 statistics to improve upon accuracy. Here is an example of how the stints for one game would look like:

Enter web table here…. (copy and paste table from excel into this area)

The events we recorded for each stint include:

Goals and Goals Allowed
Shots on Target and Shots on Target Allowed
Free Kicks won in the Attacking Half and Free Kicks given up in the Attacking Half
Red Cards for Opponents and Red Cards
Yellow Cards for Opponents and Yellow Cards
Corners and Corners Allowed

To merge all the team data sets together into a usable format we made use of R and Visual Basic. Using the rbind function in R, we could merge every team’s stint data into one file. The next step was to reformat the dataset to have every player as their own individual column, allowing for a matrix of 1s, 0s, and -1s, to be created, representing which players on the field. The reformatted matrix deleted blank rows in between and added every player as its own column. The initially collected matrix only has the players involved in the game and hence there is a blank row after every game. The reformatting or cleaning of the data is required to allow the data frame to be in a usable format where we can use a ridge regression in R to produce our output. To reformat it, we used the visual basic tool that is integrated within Excel. The algorithm we wrote in visual basic followed these basic steps:

Scroll down through the first column of the player matrix
If it was in text, copy the row of names and copy the cells below (the numbers) till a blank cell was found.
Once the next cell with text was found, compare the row contents with the ones copied earlier
If there were repeats, do not paste those names, if there were paste the names
Paste the numbers below their respective names
Repeat the process for every column

Below are two images, the player matrix we collected, and the player matrix reformatted to the required format:

Enter web table here…. (copy and paste table from excel into this area)

Reformatted Player Matrix:

Enter web table here…. (copy and paste table from excel into this area)

Due to the data being collected manually, there was an obvious possibility of typos with player names. By comparing the player matrix column names and football reference data, we were able to identify any typos. To ensure accuracy, if we were to find a typo, we would compare the columns and merge them together to make sure we have every player’s stint data. We were able to verify the players in the player matrix by comparing the player matrix column names (player names) to the names in the FBref data. As these matched up perfectly, we were sure we did not miss out on any player. We also added the values for each stint (1, 0, -1) to see if they would add up to 0. If there was no red card in the previous stint, or if there was a red card, it would add up to 1 or -1. These are some examples of measures we took to maintain accuracy. The final data frame looked like this:

Enter web table here…. (copy and paste table from excel into this area)

After we calculated the RPM, we used publicly available FBref data to obtain the following attributes for each player as they would increase the functionality of the statistic:

Position
Club
Age
Nationality
Games Played
Minutes Played

Methodology

After we collected all our data, we ran a ridge regression in R. A ridge regression is a special type of regression used when data is collinear like ours is as it eliminates the problem of multicollinearity. This regression modifies the optimization criteria for a standard linear regression. Instead of simply trying to minimize the residuals, it minimizes the sum of the residuals and a term based on the distance between the coefficients and the priors. We ran multiple ridge regressions using the events we collected as dependent variables. We used these statistics because we believe they are the most important in determining the value of a player as they will ultimately determine the outcome of the match. Below are the results of the ridge regression output for goals scored:

Call:
linearRidge(formula = Net_Goals ~ ., data = AttGoalsdata)

Paste regression output here…

Each number below a player name, is that player’s coefficient value. The coefficient value implies that when that player is on the pitch, the team will have 𝑥 more goals scored. We ran this regression for every event we collected.

The results of each ridge regression were then used as part of a weighted sum to create our final RPM value. Below are the formulas used as part of the weighted sum:

𝐴𝑡𝑡𝑅𝑃𝑀 = 𝛼1𝐺𝑜𝑎𝑙𝑠𝐶𝑜𝑒𝑓𝑓 + 𝛼2𝐹𝐾𝐶𝑜𝑒𝑓𝑓 + 𝛼3𝑆𝑜𝑇𝐶𝑜𝑒𝑓𝑓 +𝛼4𝑅𝐶𝐶𝑜𝑒𝑓𝑓 + 𝛼5𝑌𝐶𝐶𝑜𝑒𝑓𝑓 + 𝛼6𝐶𝑜𝑟𝑛𝑒𝑟𝐶𝑜𝑒𝑓𝑓
𝐷𝑒𝑓𝑅𝑃𝑀 = 𝛽1𝐺𝑜𝑎𝑙𝑠𝐶𝑜𝑒𝑓𝑓 + 𝛽2𝐹𝐾𝐶𝑜𝑒𝑓𝑓 + 𝛽3𝑆𝑜𝑇𝐶𝑜𝑒𝑓𝑓 + 𝛽4𝑅𝐶𝐶𝑜𝑒𝑓𝑓 + 𝛽5𝑌𝐶𝐶𝑜𝑒𝑓𝑓 +𝛽6𝐶𝑜𝑟𝑛𝑒𝑟𝐶𝑜𝑒𝑓𝑓

The 𝛼𝑥 represents the weight for each event with 𝛽𝑥 = −𝛼𝑥. The 𝛽𝑥 values are the negative of 𝛼𝑥 as we want the defending RPM to be scaled in a manner where the higher a player’s defending RPM, better their defending capability. For example, the GoalsCoeff in the DefRPM formula represents the amount of goals a player allows when he is on the field, hence a negative coefficient implies that player is better. Therefore, multiplying it with a negative value will allow us to scale it so that a greater DefRPM value implies you are better on defense. The weights were developed using previous studies that help in understanding the relationship of each event on the possibility of scoring a goal. Below is a graph showing the weights (𝛼𝑥):

The overall RPM of a player is simply calculated by adding the attacking RPM and defending RPM as seen by this formula:

𝑅𝑒𝑎𝑙 𝑃𝑙𝑢𝑠 𝑀𝑖𝑛𝑢𝑠 = 𝐴𝑡𝑡𝑎𝑐𝑘𝑖𝑛𝑔𝑅𝑃𝑀 + 𝐷𝑒𝑓𝑒𝑛𝑑𝑖𝑛𝑔𝑅𝑃𝑀

Here is an example of the RPM calculation for Kevin De Bruyne:

𝐴𝑡𝑡𝑅𝑃𝑀 = 1(0.035)+ 0.021(0.083)+ 0.324(0.167)+0.088(−0.0007)+ 0.0295(−0.00018)+ 0.03(0.101)
𝐷𝑒𝑓𝑅𝑃𝑀 = (−1(−0.026))+ (−0.021(−0.032))+ (−0.324(−0.146)) + (−0.088(−0.012))+ (−0.0295(−0.004))+ (−0.03(−0.233))
𝐴𝑡𝑡𝑅𝑃𝑀 = 0.094
𝐷𝑒𝑓𝑅𝑃𝑀 = 0.082
𝑅𝑒𝑎𝑙 𝑃𝑙𝑢𝑠 𝑀𝑖𝑛𝑢𝑠 = 0.094 + 0.082
𝑅𝑃𝑀 = 0.176

The coefficients and weights above have been rounded up for the sake of understanding.

Findings and Results:

As mentioned earlier, our results were both conclusive and arguably similar to that of the eye test of an avid Premier League fan but also reveal a few hidden gems in the Premier League. In this section we have visualizations that validate our findings, highlight the output of our research, and how it can be applied in the future.

In the first three visualizations below, we utilized a minutes played requirement to ensure that the results had only players that significantly contributed to their teams during the 2017-18 season. Although it may have cut out some youngsters or role players, it provides a more accurate sample of who the best players in the categories are as players with less than 1140 minutes may have a skewed RPM statistic due to a small sample size. We selected 1140 minutes as the requirement as it represents 33% of the total 3420 minutes of a 38 game Premier League season for an individual player. Since many young prospects were not included in the first four, we will look at all players under the age of 21 later in the paper. Similar to before, we placed a minutes played requirement, but in this case, it was only at least 342 minutes, which represents 10% or more. We chose a smaller requirement because as with most teams in the Premier League, it is often hard to find minutes for younger players due to the competitiveness of the league. It is not ideal to throw many of the underdeveloped and inexperienced players into the heat of the battle. Therefore, we believe a requirement of 342 minutes is effective as many U21 players do not play as much as veterans. In terms of the output, a scatter plot is used to identify the key performers. When reading it, players who are further right and higher up, think northeast in direction, are those who are considered the most effective players as these are players with a high RPM over a large sample. On the other hand, those who are the furthest right and furthest down, think southeast in direction, are the least effective players as these are players with a low RPM over a large sample.

Before we look at the visualizations it is extremely important to understand that Soccer, unlike Basketball or Hockey, do not require every player to be heavily involved on the both sides of the field, attacking and defending. While we would expect every player in Basketball or Hockey to be good on offense and defense, in soccer we do not expect a striker to excel when defending. Therefore, we believe for attacking players, their attacking RPM is a far better measure of their value than the wholistic RPM. The same can be said for defenders and their defending RPM and midfielders and their wholistic RPM. Saying this, the attacking RPM of defensive players can still be used to indicate which defenders are more involved in corners or in pushing up and vice versa for the defending RPMs of attacking players.

The first area we looked at were the attacking players of the Premier League, or those who were labeled as FW (forward), FWMF (forward-midfielder), or FWDF (forward-defender). For this visualization, as covered earlier, the attacking RPM statistic was used as it is most effective at revealing the most valuable attackers. According to the plot, we can identify Sadio Mane, Raheem Sterling and Mohamed Salah as the most effective attackers, while Jacob Murphy, Dwight Gayle, and Rajiv van la Parra are the least effective. These results are important as it brings validity to the reasoning and output behind our RPM figure as many consider Salah, Sterling, and Mane some of the elite attackers in the Premier League. While these results are not the end-all to an argument of who is better, it can provide a better outlet for analysts and fans to identify other players who provide the most value.

Next, we looked at the midfield players of the Premier League. These positions are labeled MF (midfielder), MFDF (midfield-defender), and MFFW (midfield-forward). Instead of using attacking RPM as before, total RPM (attacking + defending) was used as midfield players are usually required to contribute both offensively and defensively during the span of the match. By using the same process as before, we can identify Kevin De Bruyne, Fernandinho, and Paul Pogba as the most effective midfielders while Leroy Fer, Davy Propper and Cheikhou Kouyate are amongst the least effective. Once again, these results are promising as De Bruyne, at the time, was widely regarded as the best midfielder, and arguably the best player in the Premier League at the time.

Our next analysis was that of the defensive Premier League players. Positions were labeled as DF (defender), DFMF (defensive-midfielder), and DFFW (defensive-forward). In this case, defensive RPM was used as it most effectively values the players used in defensive positions. Declan Rice, Alberto Moreno, and Aaron Cresswell were amongst the most effective, while Craig Dawson, Ezequiel Schelotto, and Daryl Janmaat were amongst the least effective.

Next, as touched on before, a look at the top U21 performers during the 2017-18 Premier League season. Gabriel Jesus, Anthony Martial, and Leroy Sane were amongst the best while Nikola Vlasic, Alex Iwobi, and Sam Field were amongst the worst. This output gives us a great look at which young prospects are breaking through, especially those with a high RPM and minutes played, as it means they are performing consistently well across a large sample. It is also important to note that many of these players have not played as much as the players in the earlier visualizations, so one must keep in mind that a slight sample size bias may either improperly inflate or deflate their RPM statistic.

Following our individual player analyses, we can now take a look at the RPM performance by each club. First, we only included players who have amounted at least 342 minutes played, like that of the U21 plot. By using the sum of all the remaining qualifying players, we can see which clubs outperformed who. The champions, Manchester City, recorded the highest RPM which was just under 2.5, which correlates with their success on the pitch as well. Interestingly, the top 7 finishers in 17/18 were the only clubs to record a positive RPM, whereas all other 13 clubs performed negatively according to the RPM statistic. This is a testament to the validity of the RPM statistic, as it is the total RPM per club has nearly depicted the exact order of finish in the Premier League table. This relationship is further seen below:

Enter the chart…

The correlation between a club’s total RPM and their league finish can help further determine if there is a relationship between the two. Since a club’s desired finish would be lowest as possible numerically (1st, 2nd, 3rd, etc.) and that a higher RPM is better, we would expect an inverse relationship between the two. This means that as a club’s RPM increases, we can expect that they will perform better and thus finish better in the table at the end of the season. As shown above, there is indeed an inverse relationship between the two. It is most notably shown between Liverpool, Manchester City, and Manchester United where the RPM line graph spikes and plateaus while the league finish line graph falls hard. This confirmation of an inverse relationship during the 2017/18 season is extremely important as it strongly verifies the validity of our RPM statistics as it accurately depicts which clubs and players are performing the best.

Enter the chart…

In a further breakdown of how the clubs performed individually, we take a look at the attacking and defending RPM of each, instead of total RPM. Liverpool were defensively elite, although Manchester City’s league best attack may have been the clinching factor in the title. On the other hand, all three relegated teams (Swansea, Stoke, and West Brom.) were by far the worst defensively, which undoubtedly can be attributed to their plummets to the Championship.

Our RPM statistic could be used by professional soccer personnel in a variety of ways. One possible example is to use the metric to scout new younger talent to build for the future. Sometimes, especially in a sport so reliant on your teammates, a good player may go unnoticed because their basic numbers, like goals and assists, may not be as large as they could be due to the struggles of their teammates. RPM accounts for the strengths and weaknesses of the team by assessing performance changes when a player moves in and out of the lineup, so a player’s RPM could be more telling of his impact on a team versus basic goals and assists. This metric can aid the search for undervalued players who fly under the radar, that would make for great value signings, especially for clubs who do not have a large budget. Lastly, another use of this statistic includes the selection of the national team. Often, large amounts of controversy surround the selection of a national team, especially for nations with large amounts of talent. We believe this statistic can act as a deciding factor when deciding between the last few players to select.

Possible Improvements

An obvious limitation of our RPM stat is the fact that all our data was manually collected, bringing in the possibility of human error. While we used different methods to ensure accuracy as discussed earlier, each member of our team spent hours collecting match statistics and tracking all the substitutions in every game, so there is a possibility of a mistake being made. Moreover, the weights we applied to each metric could have been worked on and fine-tuned to be slightly more accurate. While we did spend some time researching and formulating each individual statistic’s respective weight, if a more accurate weight is found it would increase the accuracy of the statistic. A possible improvement that may be possible with better data would be incorporating some more advanced metrics such as key passes and expected goals. The premier league website, which we used to collect our data, did not have any of these metrics as part of their play by play commentary and hence we could not include them. Furthermore, the difference in play style between goalkeepers and other positions may limit the metric’s possible goalkeeper analysis. Finally, an additional improvement we plan to make is creating an automated data collection system that would update a player’s RPM in real time. Not only would this save us time, but it would also allow us to view a player’s up to the minute RPM. We would also be able to use our metric across other major leagues across the globe and improve the accuracy as there would ideally be no human error.

Conclusion

Overall, our RPM stat produced very promising results. As stated above, our RPM rankings generally followed what an avid and knowledgeable Premier League fan would traditionally think with players such as Sadio Mane and Kevin De Bruyne near the top of their respective player groups. Moreover, we strongly believe the comparison between a club’s RPM and their final position on the table strongly validates this metric. As we acknowledged, there are limitations with our metric due to the method we used in collecting our data. Despite this, we believe our RPM stat is a valuable metric that can be used by leagues and teams everywhere for various purposes, and we hope to automate expand our analysis across other leagues in the near future.

Bibliography

2017-2018 Premier League Player Stats. (n.d.). Retrieved July 02, 2020.

Caley, M. (2014, June 17). What is a corner kick worth in soccer? Retrieved July 02, 2020.

How many goals are scored from free kicks? (n.d.). Retrieved July 02, 2020.

Moritz, S. (2020, March 17). Ridge.pdf. Retrieved July 02, 2020.

Sakellaris, D. (2018, June 19). The Correlation Analysis of Scored Goals and Red Cards. Retrieved July 02, 2020.

Suchman, A. (2014, April 19). Explaining ESPN’s Real Plus-Minus. Retrieved July 02, 2020.

A More Accurate Measure of Player Value in the World of Soccer. Link

Contributors: Link

Acknowledgements Link

Abstract Link

Data Link

Methodology Link

Findings and Results: Link

Possible Improvements Link

Conclusion Link

Bibliography Link