Monthly Archives: November 2011

Blankety Blank

The erstwhile big 4 all blanked their opponents last Saturday and a poster on the Guardian wondered when was the previous occasion of such an occurrence. It’s a pretty simple procedure in SQL using a subquery, but in the spirit of learning R, I thought I would tackle the problem in that language, with the benefit of a couple of graphs thrown in. Code with comments below

So on days when all four teams played, this is the ninth time it has happened and the first since 9th May 2010, when they bullied their way to sixteen goals without response. After a few occasions early on in the history of the EPL, there was a gap of more than 100 games and almost nine years when at least one of their opponents scored. The graph shows the increasing dominance of these teams in the past few years as evidenced by the increased days when 3 of the 4 post shut outs

On the other hand

On 22nd November 2008, there was a pretty extraordinary happening as none of these teams scored for the first and only time of the 232 days on which they have all played. Even then, three of them managed 0-0 ties; with Arsenal the sole loser

Chart type: Scatterplot
Inspiration: Comment in Guardian
Data: Own data
Tools: MSSQL database, R
Packages: RODBC, plyr, ggplot2
Ignorance fix: qplot colour stipulation, error in
Develop: Create function for vector of teams, differing goal levels
Make Interactive with web input of parameters (help required)

# load necessary libraries - typically these are all in my startup file
# Make a connection to MSSQL database and obtain data
# The subsetting of teams is done here but could equally well be performed in R
channel <- odbcConnect("myConnection")
results <- sqlQuery(channel,paste(
SELECT      myDate, TEAM,GA,GF
FROM         myTable
WHERE     myTable.TEAM IN (N'mnu',N'chl',N'ars',N'liv')
# summarise the data. I use .s as a suffix
# I am still coming to terms with the plyr package but Stackoverflow came to rescue 
results.s <- ddply(results,.(myDate),summarise,games = length(TEAM),
blanks = length(which(GA == 0)),blanked = length(which(GF == 0)))
# check that it looks correct
    myDate 	games blanks blanked
1 1992-08-15     3      0       0
2 1992-08-16     1      0       1
# restrict to dates on which all the teams were playing
results.s <- subset(results.s,games==4)
# plot graph 
qplot(myDate,blanks, data=results.s,xlab="Game Order", ylab="Number of Shutouts"
,main="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U")
# !? Error in$year, to$year, by) : 'from' must be finite
# Try using a numbered sequence instead. 
results.s$num <- seq(from=1, to=length(glsFor.p[,1]), by=1)
# This works !? why does specifying any color produce a red dot
qplot(num,blanks, data=results.s,xlab="Game Order", ylab="Number of Shutouts"
,main="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U"
# try ggplot version and reduce point size
ggplot(results.s, aes(x=num,y=blanks))+geom_point(size=1.5,colour="steelblue")+
opts(title="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U") +
labs(x="Game Order",y="Number of Shutouts")
# Plot the times teams do not score
ggplot(results.s, aes(x=num,y=blanked))+geom_point(size=1.5,colour="red")+
opts(title="Shutouts incurred on same day by Arsenal, Chelsea, Liverpool and Man. U") +
labs(x="Game Order",y="Number of Shutouts")
# print as pdfs and amend as required in Illustrator

Created by Pretty R at

Margins of Victory

This is a repost from a few days ago that I am using as my introduction to the R-bloggers site. Having experimented with R lately, I have decided to add the relevant code to future blogs mainly in the hope of suggestions for improvemen; this code being a case in pont. You can view the script and a few notes at the end of the blog. I may go back and add code to some of the previous charts I have produced and will twitter such occurrences at @pssguy

Alex Ferguson was relieved to get a 1-0 result at Everton. The team’s first such scoreline of the year but 95th all time in the EPL, second only to Chelsea
I thought it would be interesting to look at the distribution of wins and losses by team over EPL history and as it ‘appens (RIP Jimmy), there is an R package which makes constructing a graph showing each of these fairly straight forward.

Shown here are the total, home, away and categorization of Man U results, to date

As the most successful team in EPL history wins dwarf losses. Indeed, United have won significantly more games with a three goal margin, 140, than they have lost at all, 105

The drubbing in the recent derby stands out, as does the fact that United have lost only two home games by precisely 2 goals in nearly twenty years

United have won 42 away games by a margin of three goals or more, but Arsenal pip them by 1

Chart type: Back to Back Histogram
Inspiration: Patrick Burn
Data: Own data
Tools: MSSQL database, R
Packages: stringr, RODBC, Hmisc
Fix: xaxis should show goal margin as integer perpendicular to axis
Develop: Create function for team, home/away. Add opponents, by season options
Make Interactive with web input of parameters (help required)

#  Inspired by Back to Back histogram
library(RODBC) # for database connection
library(Hmisc) # for the plot
library(stringR) # for text manipulation
# link to database using the RODBC package and create a df with relevant data
channel <- odbcConnect("myConnection")
scores <- sqlQuery(channel,paste(
select TEAM as teamID, Ground as ground,RES as result,GF as glsFor,GA as glsAg
from myTable
# take a look at the data - the first result shows a home game for Blackburn 
#teamID ground result glsFor glsAg
 #  BLB      H      D      0     0 # a no-score draw - pretty exciting!
# add a column for margin of victory/loss
scores$margin <- abs(scores$glsFor - scores$glsAg)    
# Do a little text manipulation to enhance axes labels using the stringR package
scores$result <- str_replace(scores$result,"W","Wins")
scores$result <- str_replace(scores$result,"L","Losses")
# now let's look at a plot of Manchester United's away record by setting a variable
# we should exclude the tied results as well
myData <- subset(scores,teamID=="MNU"&ground=="A"&result!="D")
# myData is available at  read.csv("")
# Now plot the data using the histbackback function in the Hmisc package
# The function creates 3 lists for wins,losses at each goal margin and plots the data
out <- histbackback(split(myData$margin, myData$result), probability=FALSE,   axes=TRUE,  
brks=c(0:max(myData$margin)), ylab="Goal Difference" ,col.main="black",
                    main = "Man. Utd. EPL  Away Results (exc draws)")
# Add some colour to help differentiate
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)

Created by Pretty R at