Monthly Archives: July 2012

Snider SSSS

On July 24th 2012, Travis Snider simultaneously lifted both a ball from Travis Blackley out of the park and his batting average for the year to more than .400
Did this mean that the much heralded prospect had finally arrived at 24 after several years of teasing.?
Well… not exactly as he was only batting in his fourth game following a callup from AAA
However, this did not stop hopeful fans from predicting that hitting in each game with homers in the latter two heralded a new dawn

This suggested that it might be fun to look at whether Snider had ever previously performed at this level over the Stupidly Small Sample Size of four consecutive games. For this crude exercise I have extracted data from the Baseball reference site for all games in which Snider has had at least two plate appearances and calculated the average OPS (a measure of hitting ability) over each four game run. Here is a graph showing his general disappointing lack of progress

The latest value of 1.15 does rank pretty highly (24th out of the 218 non-independent readings) but is well shy of the 1.64 he achieved at the tail end of the 2010 campaign when he posted 9 hits, including 3 home runs, in 18 at-bats.

Obviously, this is a trivial example but the effort is easily generalized to any player for any number of games.

Has the bubble burst? Following the aforemntioned home run, he went nine unproductive at bats before a surprisingly successful bunt and a double. Perhaps more concerning is that at the time of writing he has two strikeouts in five consecutive games for the first time in the majors

Processing Public Data with R

I use R aplenty in analysis and thought it might be worthwhile for some to see the typical process a relative newcomer goes through in extracting and analyzing public datasets

In this instance I happen to be looking at Canadian air pollution statistics.

The data I am interested in is available on the Ontario Ministry of Environment’s website. I have downloaded the hourly ozone readings from two weather stations (Grand Bend and Toronto West) for two years (2000 and 2011) which are available in several formats , including my preference, csv. According to the 2010 annual report from the Ministry, the two selected represent the extremes in readings for that year

I firstly set the directory in which the code and the associated datafiles will reside and import the data. I would normally load any R packages I will utilize at the head of the script (if not already in my start up file) but will hold off here until they are put to use.
I had to do a small amount of row deletion in the csv files so that only the readings data was included

?View Code RSPLUS
# set working directory
# create  and combine dataframes for each dataset 
# The read.csv function has ideal default arguments of header = TRUE, sep = ","
gb2000 <- read.csv("grandBendO32000.csv")
gb2011 <- read.csv("grandBendO32011.csv")
tw2000 <- read.csv("torontoWestO32000.csv")
tw2011 <- read.csv("torontoWestO32011.csv")
# combine the above
pollutant <- rbind(gb2000,gb2011)
pollutant <- rbind(pollutant,tw2000)
pollutant <- rbind(pollutant,tw2011)

It is almosr de rigeur for me to look at the structure of the resultant dataframe

?View Code RSPLUS
#data.frame':   1490 obs. of  27 variables:
# $ Station.ID: chr  "15020" "15020" "15020" "15020" ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ H01       : int  9999 6 25 16 38 25 0 15 8 12 ...
# $ H02       : int  6 6 24 14 37 17 0 13 7 9 ...
# etc.

There are several aspects of the dataframe that need to be addressed.
Firstly, 24 the 27 variables are hourly readings. To do a proper analyses I need
to reshape the data so that I have just one ‘hour’ column which contains all the
ozone readings. For this, I use the reshape2 package. As with all of the packages
covered here, it is readily downloadable from the CRAN repository from within the R environment

?View Code RSPLUS
# load the pre-installed package
# reshape data retaining the independent variables of , station Id, pollutant and date
pollutant.melt <- melt(pollutant, id = c("Station.ID", "Pollutant","Date"))
# check out the structure of the new dataframe
# data.frame':   35088 obs. of  5 variables:
# $ Station.ID: int  15020 15020 15020 15020 15020 15020 15020 15020 15020 15020 ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ variable  : Factor w/ 24 levels "H01","H02","H03",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value     : int  9999 6 25 16 38 25 0 15 8 12

The new ‘variable’ column now shows each hour a reading was taken.

I would like to rename the station ID to the geographical location. There is probably just one ID for each station but to be on the safe side that can be checked via the unique function. As it happens this shows three ids in the dataframe. On checking the raw data, it can be seen that the Toronto West station moved geographical location between 2000 and 2011. In this case, the impact on the data readings is probably minimal but this should certainly be borne in mind in any rigorous analysis

?View Code RSPLUS
# apply the unique function to the Station.ID column referable via the $ sign
# [1] 15020 35003 35125
# create a new column with the station name
# The [] notation subset subsets the rows with a specific ID
# and the <- applies the name to a new 'station' column
pollutant.melt$station[pollutant.melt$Station.ID=="35003"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="35125"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="15020"] <- "Grand Bend"

I now need to address the date field. It needs to be changed from a character to a date vector and I also want to extract various components of it for analysis. The lubridate package comes in handy here.

As you can see below, some errors were noted. The cause (at least for the first one) is that in one of the csv files, the date which should have been “03/10/2000″ had actually come across as “2000-10-02-999″. In a proper analysis, this would simply be corrected or investigated further with the original data provider. However, for the purpose of this exercise the fact that the error is found and that it relates to just 72 out of more than 35,000 observations means that the only action taken is that the offending rows are excluded from the dataframe

?View Code RSPLUS
pollutant.melt$Date <- dmy(pollutant.melt$Date)
# Using date format %d/%m/%Y. 72 failed to parse
# check the first instance by testing for NA output via the function
# again use [] notation for subsetting. the ', ' outputs all columns
 #   Station.ID Pollutant Date variable value
# 277      15020     Ozone <NA>      H01  -999
# ascertain where row 277 occurs
#276      15020     Ozone 2000-10-02      H01    38
#277      15020     Ozone       <NA>      H01  -999
#278      15020     Ozone 2000-10-04      H01    10
# look at raw data for Grand Bend 3rd Oct 2000
# exclude errors from dataframe
 pollutant.melt <- pollutant.melt[!$Date),]
# apply lubridate functions to get constituent date/time fields
# as separate columns
pollutant.melt$month <-month( pollutant.melt$Date)
  pollutant.melt$year <-year(pollutant.melt$Date)
pollutant.melt$wday <- wday(pollutant.melt$Date)

There are just a couple more aspects to tidy up. The variable column is a series of
the hour of day. Removing the leading ‘H’ and creating a new, ‘hour’ column makes sense.
You may also have noted that the very first reading is 9999. A, must-do, initial perusal
of the raw data warns that any reading of 9999 is invalid data and -999 is missing data.
Again, if this data was critical then some followup would be needed but for this situation
they can either be excluded from data as we did earlier or set to NA and specifically excluded from any analysis

?View Code RSPLUS
# create a new column
# set invalid and missing data to NA
pollutant.melt$value[pollutant.melt$value==-999] <- NA
 pollutant.melt$value[pollutant.melt$value==9999] <- NA
# if desired, any unwanted columns  can be dumped. 
pollutant.melt$Station.ID <- NULL
pollutant.melt$variable <- NULL

I now have an acceptable set of data to examine. This is not the subject of this blog
but a quick look at how the average monthly ozone levels have changed by station
by year highlights how a couple of other packages – plyr and ggplot2 – are extremely
useful contributors to the R arsenal. As with the other packages mentioned, there are
pdfs providing details of all their functions and hosts of tutorials etc. available online

?View Code RSPLUS
# create a new dataframe which gives average ozone levels
# na.rm excludes the NA readings we introduced
ozone_m_y_s <- ddply(pollutant.melt,c("month","year","station"),
#month year      station       av
#1     1 2000   Grand Bend 25.10633
#2     1 2000 West Toronto 13.73108
#3     1 2011   Grand Bend 32.62988
# write a graph of the data to file
p <- ggplot(ozone_m_y_s, aes(x=month, y=av, colour=station)) 
 p <- p + geom_line()
p <- p + facet_wrap(~ year)
p <- p+ labs(x="Month",y="Ozone level (parts per billion)")
p <- p+ opts(title="Avg Monthly Ozone levels (ppb) at two Ontario locations")

The graph highlights the fact that ozone levels are highest in the summer months and
have hardly changed at either location over the two years under consideration

Check out the blog for a further analyses of air quality data