Tag Archives: R

Processing Public Data with R

I use R aplenty in analysis and thought it might be worthwhile for some to see the typical process a relative newcomer goes through in extracting and analyzing public datasets

In this instance I happen to be looking at Canadian air pollution statistics.

The data I am interested in is available on the Ontario Ministry of Environment’s website. I have downloaded the hourly ozone readings from two weather stations (Grand Bend and Toronto West) for two years (2000 and 2011) which are available in several formats , including my preference, csv. According to the 2010 annual report from the Ministry, the two selected represent the extremes in readings for that year

I firstly set the directory in which the code and the associated datafiles will reside and import the data. I would normally load any R packages I will utilize at the head of the script (if not already in my start up file) but will hold off here until they are put to use.
I had to do a small amount of row deletion in the csv files so that only the readings data was included

?View Code RSPLUS
# set working directory
setwd("C:/Users/pssguy/Documents/R/societal/pollution")
 
# create  and combine dataframes for each dataset 
# The read.csv function has ideal default arguments of header = TRUE, sep = ","
gb2000 <- read.csv("grandBendO32000.csv")
gb2011 <- read.csv("grandBendO32011.csv")
tw2000 <- read.csv("torontoWestO32000.csv")
tw2011 <- read.csv("torontoWestO32011.csv")
 
# combine the above
pollutant <- rbind(gb2000,gb2011)
pollutant <- rbind(pollutant,tw2000)
pollutant <- rbind(pollutant,tw2011)

It is almosr de rigeur for me to look at the structure of the resultant dataframe

?View Code RSPLUS
str(pollutant)
 
#data.frame':   1490 obs. of  27 variables:
# $ Station.ID: chr  "15020" "15020" "15020" "15020" ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ H01       : int  9999 6 25 16 38 25 0 15 8 12 ...
# $ H02       : int  6 6 24 14 37 17 0 13 7 9 ...
# etc.

There are several aspects of the dataframe that need to be addressed.
Firstly, 24 the 27 variables are hourly readings. To do a proper analyses I need
to reshape the data so that I have just one ‘hour’ column which contains all the
ozone readings. For this, I use the reshape2 package. As with all of the packages
covered here, it is readily downloadable from the CRAN repository from within the R environment

?View Code RSPLUS
# load the pre-installed package
library(reshape2)
 
# reshape data retaining the independent variables of , station Id, pollutant and date
pollutant.melt <- melt(pollutant, id = c("Station.ID", "Pollutant","Date"))
 
# check out the structure of the new dataframe
str(pollutant.melt)
 
# data.frame':   35088 obs. of  5 variables:
# $ Station.ID: int  15020 15020 15020 15020 15020 15020 15020 15020 15020 15020 ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ variable  : Factor w/ 24 levels "H01","H02","H03",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value     : int  9999 6 25 16 38 25 0 15 8 12

The new ‘variable’ column now shows each hour a reading was taken.

I would like to rename the station ID to the geographical location. There is probably just one ID for each station but to be on the safe side that can be checked via the unique function. As it happens this shows three ids in the dataframe. On checking the raw data, it can be seen that the Toronto West station moved geographical location between 2000 and 2011. In this case, the impact on the data readings is probably minimal but this should certainly be borne in mind in any rigorous analysis

?View Code RSPLUS
# apply the unique function to the Station.ID column referable via the $ sign
unique(pollutant$Station.ID)
# [1] 15020 35003 35125
 
# create a new column with the station name
# The [] notation subset subsets the rows with a specific ID
# and the <- applies the name to a new 'station' column
 
pollutant.melt$station[pollutant.melt$Station.ID=="35003"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="35125"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="15020"] <- "Grand Bend"

I now need to address the date field. It needs to be changed from a character to a date vector and I also want to extract various components of it for analysis. The lubridate package comes in handy here.

As you can see below, some errors were noted. The cause (at least for the first one) is that in one of the csv files, the date which should have been “03/10/2000″ had actually come across as “2000-10-02-999″. In a proper analysis, this would simply be corrected or investigated further with the original data provider. However, for the purpose of this exercise the fact that the error is found and that it relates to just 72 out of more than 35,000 observations means that the only action taken is that the offending rows are excluded from the dataframe

?View Code RSPLUS
 
library(lubridate)
pollutant.melt$Date <- dmy(pollutant.melt$Date)
# Using date format %d/%m/%Y. 72 failed to parse
 
# check the first instance by testing for NA output via the is.na function
# again use [] notation for subsetting. the ', ' outputs all columns
head(pollutant.melt[is.na(pollutant.melt$Date),],1)
 #   Station.ID Pollutant Date variable value
# 277      15020     Ozone <NA>      H01  -999
 
# ascertain where row 277 occurs
head(pollutant.melt,278)
#276      15020     Ozone 2000-10-02      H01    38
#277      15020     Ozone       <NA>      H01  -999
#278      15020     Ozone 2000-10-04      H01    10
 
# look at raw data for Grand Bend 3rd Oct 2000
 
# exclude errors from dataframe
 pollutant.melt <- pollutant.melt[!is.na(pollutant.melt$Date),]
 
# apply lubridate functions to get constituent date/time fields
# as separate columns
 
pollutant.melt$month <-month( pollutant.melt$Date)
  pollutant.melt$year <-year(pollutant.melt$Date)
pollutant.melt$wday <- wday(pollutant.melt$Date)

There are just a couple more aspects to tidy up. The variable column is a series of
the hour of day. Removing the leading ‘H’ and creating a new, ‘hour’ column makes sense.
You may also have noted that the very first reading is 9999. A, must-do, initial perusal
of the raw data warns that any reading of 9999 is invalid data and -999 is missing data.
Again, if this data was critical then some followup would be needed but for this situation
they can either be excluded from data as we did earlier or set to NA and specifically excluded from any analysis

?View Code RSPLUS
# create a new column
pollutant.melt$hour<-gsub("H","",pollutant.melt$variable)
 
# set invalid and missing data to NA
pollutant.melt$value[pollutant.melt$value==-999] <- NA
 pollutant.melt$value[pollutant.melt$value==9999] <- NA
 
# if desired, any unwanted columns  can be dumped. 
pollutant.melt$Station.ID <- NULL
pollutant.melt$variable <- NULL

I now have an acceptable set of data to examine. This is not the subject of this blog
but a quick look at how the average monthly ozone levels have changed by station
by year highlights how a couple of other packages – plyr and ggplot2 – are extremely
useful contributors to the R arsenal. As with the other packages mentioned, there are
pdfs providing details of all their functions and hosts of tutorials etc. available online

?View Code RSPLUS
 
# create a new dataframe which gives average ozone levels
# na.rm excludes the NA readings we introduced
ozone_m_y_s <- ddply(pollutant.melt,c("month","year","station"),
summarize,av=mean(value,na.rm=TRUE))
head(ozone_m_y_s,3)
#month year      station       av
#1     1 2000   Grand Bend 25.10633
#2     1 2000 West Toronto 13.73108
#3     1 2011   Grand Bend 32.62988
 
# write a graph of the data to file
png("ozone1.png")
p <- ggplot(ozone_m_y_s, aes(x=month, y=av, colour=station)) 
 p <- p + geom_line()
p <- p + facet_wrap(~ year)
p <- p+ labs(x="Month",y="Ozone level (parts per billion)")
p <- p+ opts(title="Avg Monthly Ozone levels (ppb) at two Ontario locations")
p
dev.off()

The graph highlights the fact that ozone levels are highest in the summer months and
have hardly changed at either location over the two years under consideration

Check out the blog for a further analyses of air quality data

Simulated War

I am quite interested in both Wars with sabres and Sabremetric WARs but the War I am most involved in is the card game. Unfortunately, it is one my six year old favourites and he is quite happy to while away the hours (literally) playing it with anyone pressganged into joining him I must admit that – conscientous objector that I am – whenever a war is underway and my sons attention is elsewhere I try to slip a low card to the top of my pile which has the double attraction of both making it more likely that he wins and speeding up the conclusion of hostilities. However, I thought it might be interesting to see how long games would take without such intervention by attempting to code the process and do a simulation I often use R for analysis towards blog posts and in this instance thought it might be worthwhile showing my code. No doubt there are several improvements that could be suggested . The wiki article does post simulation results but the game I play has a couple of variations making comparisons more interesting The first thing to do is to create a deck (I have used numbers 11 through 14 for JQKA)

?View Code RSPLUS
# create a regular deck.
# All suits are equivalent so there will be four of each number
deck <- rep(2:14,4)

Then a random sample is provided to each player(p1,p2). The first 26 are pretty simple. However, the setdiff function in base R does not seem to handle duplicates so I utilized the package ‘sets’. Even that was a little tricky but as usual the answer to a stackoverflow query set me on the right track

?View Code RSPLUS
# make results reproducible
# set.seed(1066) 
assign("p1", sample(deck,26, replace=FALSE))
 
diffs <- gset_difference(as.gset(deck), as.gset(p1))
# create vector
p2 <- rep(unlist(diffs), times=gset_memberships(diffs))
# this produces the right cards but in order so randomize
 assign("p2", sample(p2,26, replace=FALSE))
p1
# [1]  3  3  6 14 10 13  3 10 12  5 11  8  2  5  8  4 10  5  8  4
# [21]  8  7  7  7  9 14
p2
# [1] 12  6  9  5  9  9 13  4  2 14  4 13 13  6  6 11  7 11 12  2
# [21] 10  3  2 11 12 14

In the analysis, I may be interested in the starting conditions of each player’s hand so I need to compute the overall value, the number of aces and – as 2′s trump aces in my variation – dueces. Only one players data is required but I will need to add the result after each game. For now I put in a “N” value

?View Code RSPLUS
p1Cards <- length(p1)
strength <- sum(p1) # 196 total of players is always 416 so p2 has stronger hand
aces <- sum(p1==14) # 2
deuces <- sum(p1==2) # 1
result <- "N"
 
 
 
game <- data.frame(id=i, strength=strength,aces=aces,deuces=deuces,result=result, stringsAsFactors=FALSE)
 
draw <- c(p1[1],p2[1])
 
booty <- c()

Now the game can start as the top card in each players deck is drawn. There are three outcomes. Either one of the players has the higher card and wins the battle or it is a tie – and a war ensues. Here p1 draws a 3 ad p2 a Queen(12) so p2 takes the drawn cards and adds them to the bottom of his deck. Here is the relevant code

?View Code RSPLUS
if (p2[1]>p1[1]) {
p2 <- c(p2[-1],draw)  
p1 <- p1[-1]
}

Of course the code needs to take account of the occasion when p1′s card is higher. There is also the wrinkle mentioned above where an Ace is trumped by a 2. Open box to see full code

?View Code RSPLUS
if (p1[1]>p2[1]) {      
if (p1[1]==14&p2[1]==2){ #  ace(14) vs a 2
p2 <- c(p2[-1],draw) 
p1 <- p1[-1]
} else {
p1 <- c(p1[-1],draw)
p2 <- p2[-1] 
}
} else if (p2[1]>p1[1]) { 
if (p2[1]==14&p1[1]==2){
p1 <- c(p1[-1],draw)
p2 <- p2[-1] 
} else {
p2 <- c(p2[-1],draw)  
p1 <- p1[-1]
}
}

The ‘fun’ starts when the cards match – which in this simulation does not occur until their final cards which are both aces. In this scenario, a variation from the wiki version, the matched cards and 3 more cards from each player form a ‘bounty’. The next card in each pack is then compared and the winner takes all 10 cards. If the cards match again the war scenario is repeated until one player proves victorious. There is also the possibility that one of the players runs out of cards and forfeits the game

?View Code RSPLUS
# keep running until displayed cards do not match
while (p1[1]==p2[1]) { 
 
# need at least 5 cards to play game
if (length(p1)<5|length(p2)<5) {
break 
}
# displayed card plus next three from each player
booty <- c(booty,p1[1],p1[2],p1[3],p1[4],p2[1],p2[2],p2[3],p2[4])
 
#  remove these cards from the p1,p2 so that new p1[1] is next shown
p1 <- p1[-(1:4)]
p2 <- p2[-(1:4)]
} 
 
draw <- c(p1[1],p2[1])
 
if (p1[1]>p2[1]) {
p1 <- c(p1[-1],booty,draw)
p2 <- p2[-1]
} else {
p2 <- c(p2[-1],booty,draw)
p1 <- p1[-1]
}

This scenario is repeated until one player is out of cards and the game is over. According to the wiki article, there is the possibility of an infinite loop being established so action is required to avoid that circumstance. Here is the relevant code

?View Code RSPLUS
# keep running total of deck size
p1Cards <- c(p1Cards,length(p1))
 
# test for game over
if(length(p1)==52|length(p1)==0){
break
}
# avoid infinite loop
if (length(p1Cards) > 5000) {
break
}
# reset for next iteration
booty <- c()
draw <- c(p1[1],p2[1])

After each game, I wish to record both the summary details and the trend in deck size

?View Code RSPLUS
# First calculate result and add to df
if (max(p1Cards)<52) {
p1Cards <-   -(p1Cards-52)
game$result = "L"
} else {
game$result = "W"
}
 
games <- rbind(games,game)
deckSize <- data.frame(i,p1Cards)
deckSizes <- rbind(deckSizes,deckSize)

The final stage is to simulate this repeatedly – I settled on 1000 – and save the data for subsequent analysis and a future blog post The full code is given below

?View Code RSPLUS
# Code for replicating War card Game
 
# Andrew Clark April 01 2012
 
library(sets)
 
# make simulation replicable
set.seed(1068)
 
games <- data.frame(id=numeric(), strength=numeric(),aces=numeric(),deuces=numeric(),result=character())
deckSizes <- data.frame(id=numeric(),details=numeric())
i <- 1
for (i in 1:10) {
# create a regular deck.
# All suits are equivalent so there will be four of each number
deck <- rep(2:14,4)
 
# make results reproducible - no longer required here
# set.seed(1066) 
assign("p1", sample(deck,26, replace=FALSE))
 
diffs <- gset_difference(as.gset(deck), as.gset(p1))
# create vector
p2 <- rep(unlist(diffs), times=gset_memberships(diffs))
# this produces the right cards but in order so randomize
 assign("p2", sample(p2,26, replace=FALSE))
p1
# [1]  3  3  6 14 10 13  3 10 12  5 11  8  2  5  8  4 10  5  8  4
# [21]  8  7  7  7  9 14
p2
# [1] 12  6  9  5  9  9 13  4  2 14  4 13 13  6  6 11  7 11 12  2
# [21] 10  3  2 11 12 14
p1Cards <- length(p1)
strength <- sum(p1) # 196 total of players is always 416 so p2 has stronger hand
aces <- sum(p1==14) # 2
deuces <- sum(p1==2) # 1
result <- "N"
 
 
 
game <- data.frame(id=i, strength=strength,aces=aces,deuces=deuces,result=result, stringsAsFactors=FALSE)
 
draw <- c(p1[1],p2[1])
 
booty <- c()
 
repeat { # for each match of cards
if (p1[1]>p2[1]) {      
if (p1[1]==14&p2[1]==2){ #  ace(14) vs a 2
p2 <- c(p2[-1],draw) 
p1 <- p1[-1]
} else {
p1 <- c(p1[-1],draw)
p2 <- p2[-1] 
}
} else if (p2[1]>p1[1]) { 
if (p2[1]==14&p1[1]==2){
p1 <- c(p1[-1],draw)
p2 <- p2[-1] 
} else {
p2 <- c(p2[-1],draw)  
p1 <- p1[-1]
}
} else {
while (p1[1]==p2[1]) { 
 
# need at least 5 cards to play game
if (length(p1)<5|length(p2)<5) {
break 
}
# displayed card plus next three from each player
booty <- c(booty,p1[1],p1[2],p1[3],p1[4],p2[1],p2[2],p2[3],p2[4])
 
#  remove these cards from the p1,p2 so that new p1[1] is next shown
p1 <- p1[-(1:4)]
p2 <- p2[-(1:4)]
} 
 
draw <- c(p1[1],p2[1])
 
if (p1[1]>p2[1]) {
p1 <- c(p1[-1],booty,draw)
p2 <- p2[-1]
} else {
p2 <- c(p2[-1],booty,draw)
p1 <- p1[-1]
}
}
#  battle over
 
# keep running total of deck size
p1Cards <- c(p1Cards,length(p1))
 
# test for game over
if(length(p1)==52|length(p1)==0){
break
}
# avoid infinite loop
if (length(p1Cards) > 5000) {
break
}
# reset for next iteration
booty <- c()
draw <- c(p1[1],p2[1])
 
} 
# war over
 
 
if (max(p1Cards)<52) {
p1Cards <-   -(p1Cards-52)
game$result = "L"
} else {
game$result = "W"
}
 
games <- rbind(games,game)
deckSize <- data.frame(i,p1Cards)
deckSizes <- rbind(deckSizes,deckSize)
 
}
 
# save for later analysis
write.table(games,"games1000random.csv")
write.table(deckSizes,"deckSizes1000random.csv")

Blankety Blank

The erstwhile big 4 all blanked their opponents last Saturday and a poster on the Guardian wondered when was the previous occasion of such an occurrence. It’s a pretty simple procedure in SQL using a subquery, but in the spirit of learning R, I thought I would tackle the problem in that language, with the benefit of a couple of graphs thrown in. Code with comments below

So on days when all four teams played, this is the ninth time it has happened and the first since 9th May 2010, when they bullied their way to sixteen goals without response. After a few occasions early on in the history of the EPL, there was a gap of more than 100 games and almost nine years when at least one of their opponents scored. The graph shows the increasing dominance of these teams in the past few years as evidenced by the increased days when 3 of the 4 post shut outs

On the other hand

On 22nd November 2008, there was a pretty extraordinary happening as none of these teams scored for the first and only time of the 232 days on which they have all played. Even then, three of them managed 0-0 ties; with Arsenal the sole loser

Chart type: Scatterplot
Inspiration: Comment in Guardian
Data: Own data
Tools: MSSQL database, R
Packages: RODBC, plyr, ggplot2
Ignorance fix: qplot colour stipulation, error in seq.int
Develop: Create function for vector of teams, differing goal levels
Make Interactive with web input of parameters (help required)

# load necessary libraries - typically these are all in my startup file
library(RODBC)
library(plyr)
library(ggplot2)
 
# Make a connection to MSSQL database and obtain data
# The subsetting of teams is done here but could equally well be performed in R
channel <- odbcConnect("myConnection")
 
results <- sqlQuery(channel,paste(
"
SELECT      myDate, TEAM,GA,GF
FROM         myTable
WHERE     myTable.TEAM IN (N'mnu',N'chl',N'ars',N'liv')
"
));
 
odbcClose(channel)
 
# summarise the data. I use .s as a suffix
# I am still coming to terms with the plyr package but Stackoverflow came to rescue 
results.s <- ddply(results,.(myDate),summarise,games = length(TEAM),
blanks = length(which(GA == 0)),blanked = length(which(GF == 0)))
 
# check that it looks correct
head(results.s,2)
 
    myDate 	games blanks blanked
1 1992-08-15     3      0       0
2 1992-08-16     1      0       1
 
# restrict to dates on which all the teams were playing
results.s <- subset(results.s,games==4)
 
# plot graph 
qplot(myDate,blanks, data=results.s,xlab="Game Order", ylab="Number of Shutouts"
,main="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U")
# !? Error in seq.int(r1$year, to$year, by) : 'from' must be finite
 
# Try using a numbered sequence instead. 
results.s$num <- seq(from=1, to=length(glsFor.p[,1]), by=1)
 
# This works !? why does specifying any color produce a red dot
qplot(num,blanks, data=results.s,xlab="Game Order", ylab="Number of Shutouts"
,main="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U"
,colour="steelblue")
 
# try ggplot version and reduce point size
ggplot(results.s, aes(x=num,y=blanks))+geom_point(size=1.5,colour="steelblue")+
opts(title="Shutouts recorded on same day by Arsenal, Chelsea, Liverpool and Man. U") +
labs(x="Game Order",y="Number of Shutouts")
 
# Plot the times teams do not score
ggplot(results.s, aes(x=num,y=blanked))+geom_point(size=1.5,colour="red")+
opts(title="Shutouts incurred on same day by Arsenal, Chelsea, Liverpool and Man. U") +
labs(x="Game Order",y="Number of Shutouts")
 
# print as pdfs and amend as required in Illustrator

Created by Pretty R at inside-R.org

Margins of Victory

This is a repost from a few days ago that I am using as my introduction to the R-bloggers site. Having experimented with R lately, I have decided to add the relevant code to future blogs mainly in the hope of suggestions for improvemen; this code being a case in pont. You can view the script and a few notes at the end of the blog. I may go back and add code to some of the previous charts I have produced and will twitter such occurrences at @pssguy

Alex Ferguson was relieved to get a 1-0 result at Everton. The team’s first such scoreline of the year but 95th all time in the EPL, second only to Chelsea
I thought it would be interesting to look at the distribution of wins and losses by team over EPL history and as it ‘appens (RIP Jimmy), there is an R package which makes constructing a graph showing each of these fairly straight forward.

Shown here are the total, home, away and categorization of Man U results, to date

As the most successful team in EPL history wins dwarf losses. Indeed, United have won significantly more games with a three goal margin, 140, than they have lost at all, 105

The drubbing in the recent derby stands out, as does the fact that United have lost only two home games by precisely 2 goals in nearly twenty years

United have won 42 away games by a margin of three goals or more, but Arsenal pip them by 1

Chart type: Back to Back Histogram
Inspiration: Patrick Burn
Data: Own data
Tools: MSSQL database, R
Packages: stringr, RODBC, Hmisc
Fix: xaxis should show goal margin as integer perpendicular to axis
Develop: Create function for team, home/away. Add opponents, by season options
Make Interactive with web input of parameters (help required)

#  Inspired by Back to Back histogram http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=136
 
library(RODBC) # for database connection
library(Hmisc) # for the plot
library(stringR) # for text manipulation
 
# link to database using the RODBC package and create a df with relevant data
 
channel <- odbcConnect("myConnection")
scores <- sqlQuery(channel,paste(
"
select TEAM as teamID, Ground as ground,RES as result,GF as glsFor,GA as glsAg
from myTable
"
));
odbcClose(channel)
 
# take a look at the data - the first result shows a home game for Blackburn 
head(scores,1)
#teamID ground result glsFor glsAg
 #  BLB      H      D      0     0 # a no-score draw - pretty exciting!
 
# add a column for margin of victory/loss
scores$margin <- abs(scores$glsFor - scores$glsAg)    
 
# Do a little text manipulation to enhance axes labels using the stringR package
scores$result <- str_replace(scores$result,"W","Wins")
scores$result <- str_replace(scores$result,"L","Losses")
 
# now let's look at a plot of Manchester United's away record by setting a variable
# we should exclude the tied results as well
myData <- subset(scores,teamID=="MNU"&ground=="A"&result!="D")
 
# myData is available at  read.csv("http://www.premiersoccerstats.com/mnuadata.csv")
 
# Now plot the data using the histbackback function in the Hmisc package
# The function creates 3 lists for wins,losses at each goal margin and plots the data
png("myPng.png")
out <- histbackback(split(myData$margin, myData$result), probability=FALSE,   axes=TRUE,  
brks=c(0:max(myData$margin)), ylab="Goal Difference" ,col.main="black",
                    main = "Man. Utd. EPL  Away Results (exc draws)")
 
# Add some colour to help differentiate
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
dev.off()

Created by Pretty R at inside-R.org

C change?

I posted a few bits of trivia on twitter@pssguy as the derby game was concluding but thought that the outcome was worth a bit closer review
So I did a bit of R coding to ascertain each club’s position at this time in the season
Here are the positions the two Manchester clubs have held in late October for every Premiership year

In the twenty years of the EPL (in some of which City were not present), this is the first time City have led United at this time of the season
United have always figured in the top 7 – a rarity for City until recently – but have actually only led the table once on October 24th in the past 11 years : this hasn’t stopped them winning five of those titles though

As for the seven ever-present teams

It is clear how the major clubs have taken a tighter grip of the league since the early days. City are the first outside this clique to head the table at this stage of the season since Leeds in 2001. The Yorkshire club were relegated two years later, a fate not likely for the Sky Blues