Using Contrasts in R

Contrasts are powerful tools in R, and combined with factors they provide a framework for handling categorical variables.  If is.factor(x) evaluates to TRUE, a variable is already a factor, or categorical variable.  This affects the behavior of functions such as summary() as well as how it is treated when used as a predictor in model fitting (e.g., for various regression or ANOVA models).  If it evaluates FALSE, a simple call to factor() can convert it to a factor class variable.

First, consider a factor with two levels, such as sex.  For dichotomous variables, one could enter it into a model as a factor or by dummy coding the variable as 0 for female and 1 for male.  This is fairly straightforward and not overly cumbersome in this instance.  However, for a three level categorical variable, the process of dummy coding becomes more tiresome.  It would be straightforward to setup a for() loop, but it would also computationally ineffecient.  Fortuneately, R provides a simple way to perform a variety of coding.  One way of doing this is via contrasts().  This function allows you to see the contrasts for a factor and to set them.  Consider the following data frame, an outcome variable, y, and a factor, x, with six levels.

samp.dat <- data.frame(y=1:12, x=factor(rep(1:6, each=2)))
str(samp.dat)
'data.frame': 12 obs. of 2 variables:
 $ y: int 1 2 3 4 5 6 7 8 9 10 ...
 $ x: Factor w/ 6 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
contrasts(samp.dat$x) #the default contrasts
  2 3 4 5 6
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 0 0 0 1

The default contrasts are treatment contrasts (n-1 dummy variables with the first level as the reference group) for regular factors or orthogonal, polynomial contrasts for ordered factors.  Notice how the reference group (level 1) is not shown in the regression coefficients.

summary(lm(y~x, data=samp.dat))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.5000     0.5000   3.000 0.024008 *
x2            2.0000     0.7071   2.828 0.030020 *
x3            4.0000     0.7071   5.657 0.001311 **
x4            6.0000     0.7071   8.485 0.000147 ***
x5            8.0000     0.7071  11.314 2.85e-05 ***
x6            10.0000    0.7071  14.142 7.81e-06 ***

You can also specify specific contrasts beyond the defaults (afterall, this is R).  The options include treatment, polynomial, sum to zero, and helmert (compare each level with mean of prior).  Here are examples.

contr.treatment(n=6)
  2 3 4 5 6
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 0 0 0 1
contr.treatment(n=6, contrasts=FALSE) #this is the identity matrix
  1 2 3 4 5 6
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 0 1 0
6 0 0 0 0 0 1
contr.treatment(n=levels(samp.dat$x), base=length(levels(samp.dat$x)))
#reference group is the last level
  1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
6 0 0 0 0 0
contr.sum(6)
  [,1] [,2] [,3] [,4] [,5]
1    1    0    0    0    0
2    0    1    0    0    0
3    0    0    1    0    0
4    0    0    0    1    0
5    0    0    0    0    1
6   -1   -1   -1   -1   -1
contr.poly(6)
             .L         .Q         .C         ^4          ^5
[1,] -0.5976143  0.5455447 -0.3726780  0.1889822 -0.06299408
[2,] -0.3585686 -0.1091089  0.5217492 -0.5669467  0.31497039
[3,] -0.1195229 -0.4364358  0.2981424  0.3779645 -0.62994079
[4,]  0.1195229 -0.4364358 -0.2981424  0.3779645  0.62994079
[5,]  0.3585686 -0.1091089 -0.5217492 -0.5669467 -0.31497039
[6,]  0.5976143  0.5455447  0.3726780  0.1889822  0.06299408
contr.helmert(6)
  [,1] [,2] [,3] [,4] [,5]
1   -1   -1   -1   -1   -1
2    1   -1   -1   -1   -1
3    0    2   -1   -1   -1
4    0    0    3   -1   -1
5    0    0    0    4   -1
6    0    0    0    0    5

summary(lm(y~x, data=samp.dat)) #using helmert contrasts
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.50000    0.20412  31.843 6.37e-08 ***
x1           1.00000    0.35355   2.828 0.030020 *
x2           1.00000    0.20412   4.899 0.002714 **
x3           1.00000    0.14434   6.928 0.000448 ***
x4           1.00000    0.11180   8.944 0.000109 ***
x5           1.00000    0.09129  10.954 3.44e-05 ***

You can clearly see the different effects of coding.  The general form for the functions is to specify how many levels of the factor there are (hence the n=6 in many or the use of levels(x)) and then sometimes additional arguments.  For example, contrasts=FALSE returns an n x n identity matrix and for contr.treatment() base=x specifies what the reference group, or base, is.  What if none of the functions produced acceptable contrasts or we had specific theories we wanted to test?  We can create our own matrix indicating the contrasts and assign it.

##Looking at another dataset
samp.dat2 <- data.frame(y=1:12, x=factor(rep(1:3, each=4)))
contrasts(samp.dat2$x)
  2 3
1 0 0
2 1 0
3 0 1

summary(lm(y~x, data=samp.dat2))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.5000     0.6455   3.873  0.00377 **
x2            4.0000     0.9129   4.382  0.00177 **
x3            8.0000     0.9129   8.764 1.06e-05 ***

summary(lm(y~I(ifelse(x==2, 1, 0))+I(ifelse(x==3, 1, 0)), data=samp.dat2))
Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)               2.5000     0.6455   3.873 0.00377 **
I(ifelse(x == 2, 1, 0))   4.0000     0.9129   4.382 0.00177 **
I(ifelse(x == 3, 1, 0))   8.0000     0.9129   8.764 1.06e-05 ***

##Manually set contrasts
contrasts(samp.dat2$x) <- matrix(c(-1,0,1,-1,-1,2),ncol=2, byrow=FALSE, dimnames=list(1:3,2:3))
contrasts(samp.dat2$x)
   2  3
1 -1 -1
2  0 -1
3  1  2

##Also notice the output from str() once we set contrasts
str(samp.dat2)
'data.frame': 12 obs. of 2 variables:
 $ y: int 1 2 3 4 5 6 7 8 9 10 ...
 $ x: Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...
  ..- attr(*, "contrasts")= num [1:3, 1:2] -1 0 1 -1 -1 2
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr "1" "2" "3"
  .. .. ..$ : chr "2" "3"

summary(lm(y~x, data=samp.dat2))
Coefficients:
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 6.500e+00  3.727e-01   17.441 3.03e-08 ***
x2          4.000e+00  9.129e-01    4.382  0.00177 **
x3          1.813e-16  5.270e-01 3.44e-16  1.00000


summary(lm(y ~ I(ifelse(x==1, -1, ifelse(x==2, 0, 1))) +
  I(ifelse(x==1, -1, ifelse(x==2, -1, 2))), data=samp.dat2))
Coefficients:
                                              Estimate Std. Error t value Pr(>|t|)
(Intercept)                                  6.500e+00  3.727e-01  17.441 3.03e-08 ***
I(ifelse(x == 1, -1, ifelse(x == 2, 0, 1)))  4.000e+00  9.129e-01   4.382  0.00177 **
I(ifelse(x == 1, -1, ifelse(x == 2, -1, 2))) 1.813e-16  5.270e-01 3.44e-16 1.00000

Clearly, one could use ifelse() statements or other means of coding as desired, but the use of contrasts() is a more intuitive, and clear method.  It is easier to look at a contrast matrix and determine coding then to wade through a complex series of coding statements.  It is also easier to specify.