This weekend brought a couple of firsts in Cardiff’s winner against Norwich
After a wretched time at Manchester United, Wilfried Zaha recorded his first Premiership assist, whilst, more interestingly, Craig Bellamy became the first player in history to score for seven different Premier League clubs
To celebrate, I thought it was worth taking a quick data dip with the new dplyr package for R, a souped up version of plyr for data.frames.
A main advantage of dplyr is that is way faster than plyr but it also offers the option to chain operations, utilizing %.%. This encourages the good discipline of planning logically ahead of coding, something I am not naturally inclined to, and should make the code more readable
I have loaded into R a largish, (270,000 row) data.frame, playerGames, of players’ appearances in the English Premier League
My target is a graph showing for each the players who have scored for the most different clubs how many games it has taken them to score their first goal for each of these teams.
The process uses several of the dplyr functions. Firstly, I want to tidy up the data, reduce the data to variables of interest and then add some required columns. I then want to find out who these itinerant players are and ascertain when they got off the mark with each club Finally I will knock out a ggplot
# load packages - make sure plyr is not running as this may cause issues library(dplyr) library(ggplot2) library(scales) # convert the data.frame to a tbl_df: #this is a wrapper around a data frame that won't accidentally print a lot of data# to screen playerGames_df <- tbl_df(playerGames) # start the munging allGames <-playerGames_df %.% # omit rows which exclude players not appearing in game filter(playerID!="OWNGOAL"&(START+subOn)>0) %.% # rename columns to standard format
# set to required columns select(playerID,teamID,goals,gameDate) %.% # sort on game date arrange(gameDate) %.% # group each player by team group_by(playerID,teamID) %.% # so that we can set a game order and cumulate goals for each #player/team mutate( game = 1:NROW(Goals), cumGoals = cumsum(Goals) ) # example row tail(allGames,1) playerID teamID goals gameDate game cumGoals 222249 OSCAR CHL 0 2014-02-03 56 10 # now we need to find these players topPlayers0) %.% # and sum the number of clubs by player group_by(playerID) %.% summarise(teams=n()) %.% # now just show Bellamy and the others who were also on six teams filter(teams==max(teams)|teams==max(teams)-1))$playerID topPlayers # "BARMBYN" "COLEA1" "BENTD" "BELLAMC" "KEANER2" #"CROUCHP" "ANELKAN" "FERDINL" # now for these players calculate the debut goal data firstGoal0) %.% # and then select first row for each player/club group_by(playerID,teamID) %.% summarise(first=min(game)) head(firstGoal,1) # playerID teamID first #1 BENTD ASV 1
At this point, my computer, WordPress and the coding wrapper decided to screw up. The rest of the code just replaces playerID with real names and uses ggplot to create a chart
A few football points to note
- Bellamy took 13 appearances to score his first Premiership goal fro Cardiff, although he had scored plenty for them in the division below. This is the longest due in part to many sub appearances, playing with a weak team and old age
- Darren Bent scored on his debut on four occasions. Anelka never managed it before game 4
- Out of roughly 4,000 players who have appeared in the Premiership, both with surname, Bent, figure. One of the two A Coles and one of the two R Keanes also appear in the list of nine
- Liverpool and Tottenham figure the most with five stops. Crouch, Keane and Bellamy have each appeared for both clubs
- All five Spurs players scored in their first four appearances. By contrast, none of the Liverpool five got off the mark before game 7 (Bellamy) with all the other is the 10-12 range