Summary

Football recruitment is about finding exceptional talents that fit into your team style.
Often we try and find similar players to a player we like. But is there anyone like fan favourite Kevin De Bruyne?
Here I instead find players that stand out, the outliers.
These outliers may flag up a few exceptional talents for you to look at in your scouting process.

Introducing outlier analysis

Most of the public football recruitment analysis tends to fall into three camps. Given a good player, could you find me a player with a similar style? Visualizations that show a fixed set of skills for a position, e.g. Radars. Scatter plots that show extreme values in one or two traits.


I got interested in the last approach via a thread by @MishraAbhiA. Yet, I am not satisfied with cherry-picking data and identifying a handful of players. Can we do better and scale this type of approach to look at many skills?


Outlier analysis is perfect for this. We use it in the industry to detect faulty machinery if we observe unusual sensor readings. Here we can take many skills and identify players where their mix of skills is unusual instead. We find these unique players via machine learning.

What is an outlier?

An outlier is a player where their skills are far away from the other players. Robert Lewandowski scored 11 more goals than his nearest competitor Messi in 2020-21. He is a clear outlier in the big five leagues, according to FBRef.

"Recruitment is all about outliers. Find me the best players, in every trait, in every league, across the globe." Ted Knutson, StatsBomb Evolve, 17 March 2021.

Top outliers

In this blog, I identify outfield players who are outliers. I use their skill information, such as the number of interceptions per 90 minutes.

I exclude centre-backs as they are more challenging to scout with data. But @EveryTeam_Mark wrote a great blog on scouting centre-backs with statistics. As a sanity check, let's check out the top-10 players identified by their outlier score. The top players look pretty great to me.

Player Team Position
Rank
1 Neymar Paris S-G Left Winger
2 Marco Verratti Paris S-G Central Midfield
3 Lionel Messi Barcelona Right Winger
4 Ángel Di María Paris S-G Right Winger
5 Aleksey Miranchuk Atalanta Attacking Midfield
6 Josip Ilicic Atalanta Second Striker
7 Kevin De Bruyne Manchester City Attacking Midfield
8 David Silva Real Sociedad Attacking Midfield
9 Bruno Guimarães Lyon Central Midfield
10 Joshua Kimmich Bayern Munich Defensive Midfield

The data

I include players who played in the big-5 leagues in England, France, Germany, Spain and Italy in 2020-21. I then exclude players who played fewer than 675 minutes over the last three seasons.

The data comes from FBRef via StatsBomb and Transfermarkt.

I combine player data over the last three seasons, so each player has one line. Combining the data ensures that the youngest players have enough data to analyze.

Find your favourite player

The best way to show the potential of this outlier analysis is to show some interactive plots. Higher points in the chart are outliers. Younger players are towards the left of the charts. For each player, I highlight the top 4 statistics that contribute the most to their outlier score.

I have split the charts into three positions. You can toggle points away by clicking on the legend, e.g. to see only the top 10% of players. On mobile, you double click the chart to zoom out.

Can you find any gems?

Left/ Right Backs

Midfielders

Forwards

1. Chart age profiles inspired by @utdarena.

Who stands out for you?

"[Analysis] pays off in avoiding mistakes and finding better players for your budget." Ted Knutson, StatsBomb Evolve, 17 March 2021

When I was playing around with the interactive charts, I liked the look of Rayan Cherki. Let's zoom into his profile and see each of the 50 statistics that go into the outlier analysis. He is a decent attacking midfielder who presses in the final third.

If your want to profile another player, then ping a message to @numberstorm on Twitter.

1. Chart inspired by @HenshawAnalysis.

2. Image credit: Кирилл Венедиктов

What is the outlier score?

If you got this far, you might want to know more details about the method for deciding outliers. All the code for this blog is open-source in my Github repo. The recipe for detecting outliers is:

  1. Download data from FBRef of the big-5 leagues for the last three seasons (2018-19, 2019-20 & 2020-21).
  2. Combine the player data so that there is one line per player. Create totals for each statistic over the last three years. Recalculate statistics such as percentages, ratios, and averages from the raw data.
  3. Download Transfermarkt data and fuzzy match to the FBRef data.
  4. Exclude players who played zero minutes for a team during 2020-21 in the big-5 leagues.
  5. Exclude players who played fewer than 675 minutes (approx 7.5 games) over the last three seasons.
  6. Exclude players who play at centre-back or goalkeeper. You might want to run your own analysis on these positions.
  7. Drop some columns that we do not want to use for outlier detection, such as yellow cards or penalties. My choice here is the number of cards does not matter. I also do not want to identify a player for taking rare set pieces.
  8. Remove correlated features. I want to make it harder to identify players who rate high in several correlated skills. For example, Jack Grealish excels in many skills related to dribbles and carries. As I keep fewer of these skills in the data, it is harder to identify him as an outlier via dribbles only.
  9. Add a column for the player's position. You might instead want to make different models for each player position. I let the model decide whether the player position was an important feature.
  10. Change the bottom 40-50% for each statistic so that all the players near the bottom have the same value. I don't want to flag people as outliers for their poor figures.
  11. Use the isolation forests algorithm from sci-kit to identify outliers. Use SHAP values to identify the top contributors to the outlier score.
    Note: Isolation forests scores each football player for their uniqueness. It builds several trees which randomly split the players by a random statistic, such as expected goals. The fewer splits required to isolate a player, the more likely the player is an outlier.

Discussion

A large majority of a player's time in football is off the ball. But most of the stats in this analysis ignore this. The stats also ignore the context in which the team plays. Here are some interesting Twitter conversations on this.


There is also an interesting blog by TiotalFootball about how tracking data might provide greater context. With some clever engineering, you could create some off-ball events from tracking data and StatsBomb's new 360 data, which could complement this analysis.


If you want to extend this analysis, here are a few ideas:

  • include central defenders
  • adjust the stats for possession or the league to account for stylistic differences
  • identify players with unique passing styles. You could bin or cluster passes to create totals and use these to identify outliers
  • add off-the-ball features or speed data
  • create separate models for each position

Get in touch

I would love to hear your thoughts or feedback. Please get in touch on Twitter @numberstorm or via email at rowlinsonandy@gmail.com.