Analysis of Covariance – Extending Simple Linear Regression

April 28th, 2010

The simple linear regression model considers the relationship between two variables and in many cases more information will be available that can be used to extend the model. For example, there might be a categorical variable (sometimes known as a covariate) that can be used to divide the data set to fit a separate linear regression to each of the subsets. We will consider how to handle this extension using one of the data sets available within the R software package. Read the rest of this entry »

Summarising data using box and whisker plots

April 25th, 2010

A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) and the minimum and maximum values. Read the rest of this entry »

Simple Linear Regression

April 23rd, 2010

One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities in R for fitting a range of linear models from the simple case of a single variable to more complex relationships. Read the rest of this entry »

Book Review – ggplot 2: Elegant Graphics for Data Analysis by Hadley Wickham (Springer 2009)

April 20th, 2010

[amazonshowcase_f9f538eb9570867e4ada7f9e81ecd52a]

This book is written by the author of the ggplot2 package for R, which is a package with a design inspired by the grammar of graphics and can remove some of the effort required to put together impressive graphs. The book is just under 200 pages and covers a decent range of material to introduce new and experienced R users to the ggplot2 package. Read the rest of this entry »

R and Tolerance Intervals

April 19th, 2010

Confidence intervals and prediction intervals are used by statisticians on a regular basis. Another useful interval is the tolerance interval that describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. The R package tolerance can be used to create a variety of tolerance intervals of interest. Read the rest of this entry »

Summarising data using scatter plots

April 18th, 2010

A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set. This type of graph is used in many common situations and can convey a lot of useful information. Read the rest of this entry »

Working with themes in Lattice Graphics

April 12th, 2010

The Trellis graphics approach provides facilities for creating effective graphs with a consistent look and feel and one of the good things about the system is the use of themes to define the colour, size and other features of the components that make up a graph. The lattice package in R is an implementation of the approach and in this post we will consider how to change the default settings. Read the rest of this entry »

Summarising data using histograms

April 11th, 2010

The histogram is a standard type of graphic used to summarise univariate data where the range of values in the data set is divided into regions and a bar (usually vertical) is plotted in each of these regions with height proportional to the frequency of observations in that region. In some cases the proportion of data points in each region is shown instead of counts. Read the rest of this entry »

Summarising data using dot plots

March 26th, 2010

A dot plot is a type of display that compares counts, frequencies, totals or other summary measures for a series of categories. The dot plot can be arranged with the categories either on the vertical or horizontal axis of the display to allow comparising between the different categories as well as comparison within categories where there are multiple symbols used to denote say different years. Read the rest of this entry »

Measuring the length of time to run a function

March 16th, 2010

When writing R code it is useful to be able to assess the amount of time that a particular function takes to run. We might be interested in measuring the increase in time required by our function as the size of the data increases. Read the rest of this entry »