Soccer Data Analysis

Introduction

This soccer database comes from Kaggle. It contains data for soccer matches, players, and teams from several European countries from 2008 to 2016. The database covers 24,637 matches and over 10,000 players. Detailed match events like goal types, possession, corner, cross, fouls, cards are also recorded in the database. Moreover, this dataset integrated players and teams attribute ratings sourced from EA Sports’ FIFA video game series. We're most interested in figuring out if a player's potential for assisting goal is high, his potential for a goal is also high, and which club team improved the most over the period, and each related ranking associated with Tottenham Hotspur.

Data Wrangling

The dataset is a SQL database file containing seven tables of "Country," "League," "Match," "Player," "Play-er_Attribute," "Team," "Team_Attributes." We downloaded the SQL file from the Kaggle repository and extracted the CSV files from the DB browser application to get the data source. For the research, we used four tables formatted as CSV: Match, Team, Player, and Player_Attributes. First, we sorted the Match table with required columns, extracted data, and the names of players who recorded goals or assists and then merged it with each of the other CSV files after renaming the ID columns. We add new columns for the winner of the match and goal point by net goals and finally get 7173 lines of data with no null values to clear the questions.



Data Cleaning

Based on the first data loaded in the previous cell, I will merge it with the other two data, which contains the player's attribute, birthday, height, weight, player's overall rating, potential, and team names. According to the initial interest, I will then extract age and names for players who recorded a goal or assist from the merged data. Lastly, I will calculate the goal point, which is home_team_goal - away_team_goal.



Research Question 1 Soccer is one of the most popular sports in the world, even in Asia. Because soccer was known to Asia much later than the other continents, Asian teams hardly won the game with Europe or South America. When they had a landslide loss, I frequently came across an article or comments from audiences making excuses for the failure in the game with the player's physical benefit, age, overall rating, and potentials. To clarify this question, I would like to see a correlation between a player's overall rating/potential/height/weight/age and recording goals.

We made two groups. Each group has 30 players per the number of records from the top and bottom, respectively, because comparing those with no goal record with the top players is difficult to determine a valid comparison value. We get all the index numbers of each group and then calculate the average of the player's overall ratings and find the cumulated average for each group. We figured out that the top 30 players who recorded goals had 12 more overall ratings and 9 points in potentials. However, they were three years older and 3lbs heavier than the other group. Considering the range of player's overall ratings and potential are 34 and 32 each, the gap is significant between those two groups. Therefore, it can be evidence that the overall rating and potentials with accumulated experience in the league positively affect making a goal. To make sure if the height and weight affect overall ratings and potential ratings, we implemented a regression analysis by plotting a scatter chart putting height and weight at X-axis and Potential ratings and overall ratings at Y-axis for the entire players recorded more than a goal. We also selected colors per the held location to check those three variables in the same plot. From the implementation, we could not find a significant correlation between them. In these findings, on top of everything, I can genuinely rely on the EA's rating algorism and do not prefer those to excuse the failure due to physical benefit only, and I can mention that top players' age mean experience.

Attribute descriptions for top 30th and bottom 30th goal or assist players are as follows.



Research Question 2 We know that the best play for the season is not always the best assister. I doubt that the attributes in assisting goal are possibly only ability to pass the ball well on the correct times but nothing to do with making goal. Therefore, I would like to know if a player's potential for assisting a goal is high, his potential for making a goal is also high.

First, we plotted a regression graph by putting the potential ratings of players recorded assists at X-axis and their overall ratings at Y-axis so as to check it surely related before implementing the intended test, and the graph shows closely co-related pattern. Then, we plotted a regression graph by putting assist-potential at X-axis and goal-potential at Y-axis. We also selected colors per the held location to check those three variables in the same plot. The scatter plot shows a regressive pattern heading straight upward, but it was not tight as much as the former analysis was; therefore, to make sure, we also conducted a Logistic Regression model to check the coefficient and P-value. Although the results show a low R-squared value, considering the coefficient;0.52 and P-values; less than 0.0001, we could find evidence that there is a remarkable correlation between those two attributes. From the analysis, I would recommend improving the passing skill first and then developing tactical moves to make goals for the soccer dreamers.



Research Question 3 As a Tottenham Hotspur lover, I never skip watching the game. Although the recent two seasons were not so great as they did, there is no doubt that they have been a prestigious team for many years without a tremendous downturn per the stadium's location. So, I want to find if they won more when they played at their home stadium and the gap is remarkable.

To answer this and the next questions, we melted the initial DataFrame by relocating the entire players' IDs and numbers. And then, selected the team winner as Tottenham Hotspur and categorized each location per season. We chose a count plot as setting X-axis with the location of the game and added different colors on the bars per season. According to the shown chart, except for the 2013/2014 season, the team had more winning records when they played in the home stadium than in the opponents' place. The team gradually increased the number of wins played in away stadiums after touching the lowest winning record during the 2012/2013 season. The result is interesting enough to know that how the home benefit affects the game and why they have two matches per team when in the Europa championship league in each location.



Research Question 4 As a Tottenham Hotspur lover, it is also important to know even about single details. Since I have limited knowledge about who recorded the most goal, assist, and starting lineup of the game in Tottenham Hotspur over the recent decade, I want to get the information from this question.

I grouped two cases for this research; one is when the team won, and the other is else case. We plotted a bar chart for each of the top 10 players. The top player who recorded in the starting lineup was 'Aaron Lennon' and 'Kyle Walker respectively. And the top player who recorded goals was 'Gareth Bale' and 'Harry Kane' respectively. Lastly, the top player who recorded assists was 'Aaron Lennon' in both cases. I did not know that Bale was so great when he was here. Indeed, he was in his prime during the period with the team.



Research Question 5 Because European soccer aged more than a hundred years, I believe a new team has advanced to the primary league from the secondary league or vice versa during history. With this curiosity, I want to check which club team improved the most over the period and each related ranking?

To find the answer, we sorted each team per the number of winning for each season and combined the entire season data to get a simplified DataFrame. From the new DataFrame, we sorted out the top and bottom ten teams in recording wins and compared each season. To select the most improved team, we applied conditions that the team's overall performance should not have a sudden downturn over the period. From the bar chart plotted with bivariable, the most improved team during the period was FC Bayern Munich. The team only recorded five wins in the first season. They maintained the standings without plummetting and finally recorded five times as many wins in the last season. There should be a remarkable difference between the first and last season. I realized there is no champion team last for long in the league. And every team can become the champion under the proper training because there is not a high difference in the level of each team. After all, the team's attributions are more likely standardized.



Additional finding

In the entire season, Manchester United won most; the top player recorded goals and assists are Cristiano Ronaldo and Lionel Messi, respectively.



Conclusions

We were most interested in figuring out if a player's potential for assisting goals is high, his potential for recording goals is also high, which club team improved the most over the period, and each related ranking associated with Tottenham Hotspur. Before we write a summary, we would like to highlight that we analyzed data having 7173 lines for nine seasons of 11 countries' leagues for these questions. As we all know, the number of data matters in finding more precise answers. And also, given that a private company updated the attribute of players, it may not contain objective ratings according to the players' actual performance, as an instance, some players could suffer an injury and end the season earlier. We conducted the multidirectional statistical approach and learned that there is a remarkable correlation between a player's assist potential and goal potential. Except for the 2013/2014 season, Tottenham Hotspur had a more winning record when they played in the home stadium than in the opponents' place. The top player who recorded in the starting lineup in Tottenham Hotspur was 'Kyle Walker', the top player who recorded goals was 'Harry Kane, and the top player who recorded assists was 'Aaron Lennon.' Besides, we also figured out FC Bayern Munich improved the most over the period. In the entire season, Manchester United won most; the top player recorded goals and assists are Cristiano Ronaldo and Lionel Messi, respectively. Additionally, we found out that the top 30 players who recorded goals had 12 more points in overall ratings and 9 points in potentials; however, they were three years older and 3lbs heavier than the bottom 30 players who recorded goals. In these findings, we can reduce our prejudice about the superficial aspects such as physical benefits and age in winning a game, passing skill presumably the fundamental factor to become a successful forwarder, and a home stadium benefit exists. Finally, each team has an open chance to become a champion.





more on:


https://github.com/cpasean/Soccer-dataset-by-Kaggle/blob/main/_project_Investigate_a_Soccer%20Dataset.pdf


  • Twitter
  • LinkedIn
  • Pinterest
  • YouTube
17880656449892228.jpg