Monthly Archives: October 2012

Charting Wikipedia interest in GOP candidates with googleVis

I recently posted an article on how to collate Wikipedia page views

As there is a time component to this, it seemed appropriate to use the googleVis Package to visualize changes in page hits in the Google Motion chart

For this exercise, I ran the wikiFun function covered in the last post to collate page visits for the main Republican candidates for the Presidency. There are a couple of points worth mentioning so I have incorporated the relevant code

?View Code RSPLUS
# dataframe, GOPdata, has daily Wikipedia page views for leading candidates from beginning of 2011
# create and plot chart locally
myChart <- gvisMotionChart(GOPdata, idvar="name", timevar="date")
# set preferred initial state (e.g line rather than bubble chart etc.) then from
#Settings(the wrench) > Advanced > Advanced copy the state string to a variable
# enclose the copied string in single quotes exactly as below
initState <- ' 
   "orderedByY":false,"sizeOption":"_UNISIZE","iconKeySettings":[{"key":{"dim0":"Mitt Romney"}}],
# reproduce chart with
myChart <- gvisMotionChart(GOPdata, idvar="name", timevar="date", options=list(state=initState))
# copy the chart html to a file
cat(myChart$html$chart, file="GOP2012.html")

The html can then be massaged as required and uploaded. I would normally post this on this blog but I had difficulties using the suggested custom fields plugin and , in any case, it would probably be too large. It can be viewed here
N.B. Flash needs to be installed to view Google Charts

The graph reflects the changes in fortune of the participants quite well with eight different candidates having at least six days at the top. Now the election is nigh, however, Romney is outpacing his closest rival 10:1

Notes on a Scandal – When Jimmy beat Katy

No the title doesn’t refer to how Katy Perry suffered at another of Jimmy Savile’s sexual predelictions, although these are two of  the participants. I’ll get to the details later

Just over a year ago, I reflected on the relative wiki searches of leading female singing celebrities, including Ms Perry. In the light of the recent Jimmy Savile scandal, I thought to revisit the area.

For the first post, I relied on code from a now-defunct web site and had not examined the raw data. It now appears to me as though wiki are not providing the information in the same way. The good news is that they offer a web page with daily searches for each month in JSON format, which actually simplifies matters

For this exercise, I have produced a function which collects and tabulates data for a set of people, produces graphs of their individual daily count data from the beginning of 2008 onwards and creates a group graph within a specified date range. The code is shown at the bottom of the page

Here is some of the output for some of the people mentioned during the scandal coverage

Savile, naturally, leads the way with ex-glam rock star, Gary Glitter, following. This probably reflects his generally greater fame and the severity of the allegations against him compared with DJ, Dave Lee Travis, and dead actor, Wilfrid Brambell

Now for the summary table. The difference between median and mean reflects the situation of steady daily searches punctuated by leaps when publicity occurs

Interestingly, the scandal has not produced the maximum search count for any of the four.

  • Dave Lee Travis peaked when Burmese pro-democracy leader Aung San Suu Kyi said his World Service programme had given her a lifeline
  • Over the timespan of the scandal, Savile’s travails in terms of searches are significant but his death sparked the individually highest rate
  • A TV show, detailing a feud between Brambell and his co-star of “Steptoe and Son”, Harry H Corbett, led to the former’s highest search on Wikipedia

Glitter’s graph shows several peaks before this month representing chronologically; his release from Thai jail and attempt to avoid returning to the UK; the mockumentary, “The Execution of Gary Glitter” shown on Channel 4; and incorrect rumours that he was planning a new tour

So how did Jimmy beat Katy? With a max search almost double her highest of 101,922

?View Code RSPLUS
# Packages required
library(RJSONIO) # acquiring and parsing data
library(ggplot2) # graphs
library(plyr) # creation of summary data
# create dataframes for all and summary data
allData <- data.frame(count=numeric(),date=character(),name=character())
summaryata <- data.frame(name=character(),mean=numeric(),median=numeric(),max=numeric(),maxdate=character()) #maxdate=date() causes error
# create variables for url
month <- c("01","02","03","04","05","06","07","08","09","10","11","12")
year <- c(2008:2012)
# function with default dates for comparison graph
wikiFun <- function(person, startDate="2012-09-01",endDate="2012-11-01") {
  for(k in 1:length(person)) {
    # create dataframe for individual records
    df <- data.frame(count=numeric()) 
    for (i in 1:length(year)) {
      for (j in 1:length(month)) {
        url <- paste0("",year[i],month[j],"/",person[k]) <- readLines(url, warn="F") 
        rd  <- fromJSON(
        rd.views <- rd$daily_views 
        df <- rbind(df,
    # create a df with all peoples search counts by day
    df$date <-  as.Date(rownames(df))
    df$name <- person[k]
    colnames(df) <- c("count","date","name")
    df <- arrange(df,date)
    allData <- rbind(allData,df)
    # set title display and save individual's graph
    theTitle <- paste0("Daily Wikipedia searches for ",person[k])
    q <- ggplot(subset(df,df$count>0),aes(x=date,y=count))+geom_point()+xlab("")+ylab("")+ggtitle(theTitle) # individual plot prints to screen
       fname <- paste0("ws_",gsub(" ","",person[k]),".png")
  # display and save group graph using log scale for counts
  p <- ggplot(subset(allData,count>0&date>=as.Date(startDate, "%Y-%m-%d")&date<=as.Date(endDate, "%Y-%m-%d")),aes(x=date,y=count, colour=name))+geom_line()+xlab("")+ylab("")+ggtitle("Comparison of Daily Wikipedia searches")  + coord_trans(y="log2") #+scale_y_continuous(formatter=comma) caused error
  # calculate summaries , display and save
  summaryData <- ddply(subset(allData,count>0),.(name), summarize, mean=mean(count), median=median(count), max=max(count), max_date=date[which.max(count)] )
names <- c("Gary Glitter","Jimmy Savile","Dave Lee Travis","Wilfrid Brambell")

Player timelines with ggplot

Timelines can be quite a handy way of getting an overview of a player’s career in terms of when they played, with which team and who were their contemporaries
As often is the case, I turned to Stackoverflow to set me on my way for an R solution. In this instance, I did not take the accepted answer but rather the ggplot variation.
I used the RODBC package to extract records of all EPL appearances from my database into a dataframe, ‘allGames’

?View Code RSPLUS
1     Steve    Jones  JONESS1          F    WHU        2054 West Ham U 1993-11-01     0  0

The data is pretty self-evident. Position shows that Steve Jones is a forward and that for the game in question he neither started nor was used as a substitute. As I am basically trying to show when players were in the team squad, I will still include these data in the analysis. To obtain a player’s career length at a particular club, I need to find the earliest and latest dates: probably overkill, but I am used to using the plyr package

?View Code RSPLUS
allGames.summary <- ddply(allGames,.(PLAYERID,TEAMID),function(x) c(start=min(x$DATE),end=max(x$DATE)))
# Here is Steve Jone's line at West Ham
PLAYERID TEAMID      start        end
2574  JONESS1    WHU 1993-08-14 1997-02-01

OK. Now we can get to some graphing. Let’s go way back to the beginning of the Premier League and look at the squad of the champions that season, Manchester United, id ‘MNU’

?View Code RSPLUS
q <- ggplot(subset(allGames.summary,TEAMID=="MNU"&start==as.POSIXct(min(allGames.summary$start)))) +
  geom_segment(aes(x=start, xend=end, y=PLAYERID, yend=PLAYERID), size=3)

Note the use of the min function again to get the first date and the geom_segment function of ggplot – perfect for producing the required lines. Two gotchas to watch out for. The dates are of POSIXct datatype and unless they are coerced to that an error arises. Also, if the ‘+’ is placed on the second line the layer does not get added and no plot appears

So what have we got?

As can be seen, the data looks reasonable. All the lines start at one point and show different end points. To those in the know, Giggs’s line correctly extends to the current day; he is the only player appearing 20 years ago still to pull on a shirt.
However, it is not that aesthetically pleasing. Aspects that could be included include

  • Change axes labels and add a title
  • Make player’s name more apparent
  • Show other EPL teams appeared for, if any
  • Give some indication of relative appearances
  • Utilize the full width of the graph
  • and finally

  • Wrap it in a function

Some of these amendments need more analysis, others are just adding to the ggplot code

?View Code RSPLUS
# we  need players name from the original dataframe. 
allGames$player <- paste(allGames$LASTNAME,str_sub(allGames$FIRSTNAME, end=1),sep=" ") #str_sub is in the loaded plyr package
# the allGames.summary needs to be reworked
allGames.summary <- ddply(allGames,.(PLAYERID,PLAYER_TEAM,TEAMID,player),function(x) c(start=min(x$DATE),end=max(x$DATE),apps=length(x$player)))
# create a function which takes the team id and game date as parameters
tlPlot <- function(theTeam,theDate) {
  # to cover all clubs a player appeared for we need to obtain a list of their ids
squad <- subset(allGames.summary,TEAMID==theTeam&start==as.POSIXct(theDate))$PLAYERID
# order the data by the number of appearances whilst with the team ( and reversed for graph)
  playerOrder <- arrange(subset(allGames.summary,TEAMID==theTeam&PLAYERID %in% squad),desc(apps))$player
  playerOrder <- rev(playerOrder)
# create the title (full team name and date would be shown with more space)
  theTitle <- paste("Careers for players appearing for",theTeam,"on",theDate,sep=" ")
# Now create the graph object
  # subset to selected players but for all their teams , indicated by colour
  q <- ggplot(subset(allGames.summary,PLAYERID %in% squad), aes(colour=TEAMID)) +
    # show player surname and initial
    geom_segment(aes(x=start, xend=end, y=player, yend=player), size=3) +
    # order players in terms of apps for team
    scale_y_discrete(limits=playerOrder) +
    # get rid of axis labels and add the title
    xlab("") + ylab("") +ggtitle(theTitle)+
    # extend lines to full width
    scale_x_datetime(expand = c(0, 0))
# make selection. In a production version test for valid teams and 
# dates would be performed


Not perfect – but certainly more informative and now replicable. The analysis can easily be extended. For instance, one could select the players with top ten appearances for a club or show all those who were on squads whilst a particular player was there. The position factor could be identified by colour whilst using an alpha scale for apps.
But that’s all for now

Goalies Galore

One of the oddities of the first seven rounds of the season is that three of the top four teams from last year have already used two starting goalkeepers, with only Joe Hart at Man City the exception. In all eight teams have double-dipped.

Hart was one of only eight players (six goalies) who played the every minute last season and it could well be argued that it was superiority at that position was the difference-maker as City triumphed over their rivals, Man Utd, for the title on goal difference

Here are the appearances by goalkeepers in all the championship winners of the EPL

  • Hart is one of only three players to have played every game in net for the champions
  • Arsenal managed to win the title in 2001/2 in spite of first-choice, David Seaman, starting just 17 games due to his injury. Neither his presumptive heir, Richard Wright, who conceded eight goals in the three home games he appeared in, or Stuart Taylor, basically a bench player ever since, were that impressive but both played sufficient games to earn a championship medal
  • Manchester United used seven different goalkeepers in their successive title wins of 1999-2001. Two years later, a further two names featured.
  • Here is the wikipedia extract on Nick Culkin
    He holds the record for the shortest debut in Premier League history, replacing Raimond van der Gouw in stoppage time against Arsenal at Highbury on 22 August 1999, the referee blew up immediately after Culkin took the resulting free kick.

Suarez Surprises

In the 3-2 loss to Udinese, one highlight for Liverpool was Suarez stroking in a 20+ yarder from a direct free kick. Nothing new there as EPL index pointed out the fact that Luis Suarez had scored four of his five league goals from outside the area.
In spite of being Liverpool’s leading scorer last year and an extremely creative player, he has been better known up until now as a misser of chances and a penalty-seeking diver. Indeed of the 15 league goals he had scored prior to this season, only one was from outside the penalty box.

How does this stack up to all-time performances

  • Only Matt le Tissier has scored as many as eight in a full season EPL season (1994/5)
  • Of players who have scored more than 10 goals in a season Lampard’s 64% rate, 2006/7, is the highest proportion in an individual season
  • Alan Shearer has the most EPL career goals, 35, but they only represent 13% of his total – slightly less than the 15% long range goals account for overall
  • Of players with more than 20 career goals, Beckham’s 53% is the highest proportion. 18 of his 33 successful long range efforts were from direct free kicks

Finally, back to Suarez and his hat-trick v Norwich. In more than 7,900 Premiership encounters, he was the first to score three goals in a game from outside the area – although in total they barely mustered 50 yards