Using Contrasts in R
Contrasts are powerful tools in R, and combined with factors they provide a
framework for handling categorical variables. If
is.factor(x) evaluates to TRUE, a variable is
already a factor, or categorical variable. This affects the behavior of
functions such as summary() as well as how it is
treated when used as a predictor in model fitting (e.g., for various regression
or ANOVA models). If it evaluates FALSE, a
simple call to factor() can convert it to a factor
class variable.
First, consider a factor with two levels, such as sex. For dichotomous
variables, one could enter it into a model as a factor or by dummy coding the variable as 0 for female and 1 for male. This is fairly straightforward and not overly cumbersome in this instance.
However, for a three level categorical variable, the process of dummy coding becomes more tiresome. It would be straightforward to setup a
for() loop, but it would also computationally ineffecient. Fortuneately, R provides a simple way to perform a variety of coding. One way of doing this is via
contrasts(). This function allows you to see the contrasts for a factor and to set them.
Consider the following data frame, an outcome variable, y,
and a factor, x, with six levels.
samp.dat <- data.frame(y=1:12, x=factor(rep(1:6, each=2)))
str(samp.dat)
'data.frame': 12 obs. of 2 variables:
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
$ x: Factor w/ 6 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
contrasts(samp.dat$x) #the default contrasts
2 3 4 5 6
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 0 0 0 1
The default contrasts are treatment contrasts (n-1 dummy variables with the
first level as the reference group) for regular factors or orthogonal,
polynomial contrasts for ordered factors. Notice how the reference group
(level 1) is not shown in the regression coefficients.
summary(lm(y~x, data=samp.dat))
Coefficients:
Estimate Std. Error t
value Pr(>|t|)
(Intercept) 1.5000 0.5000 3.000
0.024008 *
x2 2.0000
0.7071 2.828 0.030020 *
x3 4.0000
0.7071 5.657 0.001311 **
x4 6.0000
0.7071 8.485 0.000147 ***
x5 8.0000
0.7071 11.314 2.85e-05 ***
x6 10.0000
0.7071 14.142 7.81e-06 ***
You can also specify specific contrasts beyond the defaults (afterall, this is
R). The options include treatment, polynomial, sum to zero, and helmert
(compare each level with mean of prior). Here are examples.
contr.treatment(n=6)
2 3 4 5 6
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 0 0 0 1
contr.treatment(n=6, contrasts=FALSE) #this is the identity matrix
1 2 3 4 5 6
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 0 1 0
6 0 0 0 0 0 1
contr.treatment(n=levels(samp.dat$x), base=length(levels(samp.dat$x)))
#reference group is the last level
1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
6 0 0 0 0 0
contr.sum(6)
[,1] [,2] [,3] [,4] [,5]
1 1 0 0
0 0
2 0 1 0
0 0
3 0 0 1
0 0
4 0 0 0
1 0
5 0 0 0
0 1
6 -1 -1 -1 -1 -1
contr.poly(6)
.L
.Q .C
^4 ^5
[1,] -0.5976143 0.5455447 -0.3726780 0.1889822 -0.06299408
[2,] -0.3585686 -0.1091089 0.5217492 -0.5669467 0.31497039
[3,] -0.1195229 -0.4364358 0.2981424 0.3779645 -0.62994079
[4,] 0.1195229 -0.4364358 -0.2981424 0.3779645 0.62994079
[5,] 0.3585686 -0.1091089 -0.5217492 -0.5669467 -0.31497039
[6,] 0.5976143 0.5455447 0.3726780 0.1889822
0.06299408
contr.helmert(6)
[,1] [,2] [,3] [,4] [,5]
1 -1 -1 -1 -1 -1
2 1 -1 -1 -1
-1
3 0 2 -1 -1
-1
4 0 0 3 -1
-1
5 0 0 0
4 -1
6 0 0 0
0 5
summary(lm(y~x, data=samp.dat)) #using helmert contrasts
Coefficients:
Estimate Std. Error t
value Pr(>|t|)
(Intercept) 6.50000 0.20412 31.843 6.37e-08 ***
x1 1.00000
0.35355 2.828 0.030020 *
x2 1.00000
0.20412 4.899 0.002714 **
x3 1.00000
0.14434 6.928 0.000448 ***
x4 1.00000
0.11180 8.944 0.000109 ***
x5 1.00000
0.09129 10.954 3.44e-05 ***
You can clearly see the different effects of coding. The general form for
the functions is to specify how many levels of the factor there are (hence the
n=6
in many or the use of levels(x)) and then sometimes
additional arguments. For example, contrasts=FALSE
returns an n x n identity matrix and for contr.treatment()
base=x specifies what the reference group, or base, is. What if
none of the functions produced acceptable contrasts or we had specific theories
we wanted to test? We can create our own matrix indicating the contrasts
and assign it.
##Looking at another dataset
samp.dat2 <- data.frame(y=1:12, x=factor(rep(1:3, each=4)))
contrasts(samp.dat2$x)
2 3
1 0 0
2 1 0
3 0 1
summary(lm(y~x, data=samp.dat2))
Coefficients:
Estimate Std. Error t
value Pr(>|t|)
(Intercept) 2.5000 0.6455 3.873
0.00377 **
x2 4.0000
0.9129 4.382 0.00177 **
x3 8.0000
0.9129 8.764 1.06e-05 ***
summary(lm(y~I(ifelse(x==2, 1, 0))+I(ifelse(x==3, 1, 0)), data=samp.dat2))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.5000 0.6455 3.873 0.00377 **
I(ifelse(x == 2, 1, 0)) 4.0000 0.9129
4.382 0.00177 **
I(ifelse(x == 3, 1, 0)) 8.0000 0.9129
8.764 1.06e-05 ***
##Manually set contrasts
contrasts(samp.dat2$x) <- matrix(c(-1,0,1,-1,-1,2),ncol=2, byrow=FALSE,
dimnames=list(1:3,2:3))
contrasts(samp.dat2$x)
2 3
1 -1 -1
2 0 -1
3 1 2
##Also notice the output from str() once we set contrasts
str(samp.dat2)
'data.frame': 12 obs. of 2 variables:
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
$ x: Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...
..- attr(*, "contrasts")= num [1:3, 1:2] -1 0 1 -1 -1 2
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr "1" "2" "3"
.. .. ..$ : chr "2" "3"
summary(lm(y~x, data=samp.dat2))
Coefficients:
Estimate Std.
Error t value Pr(>|t|)
(Intercept) 6.500e+00 3.727e-01 17.441 3.03e-08 ***
x2 4.000e+00
9.129e-01 4.382 0.00177 **
x3 1.813e-16
5.270e-01 3.44e-16 1.00000
summary(lm(y ~ I(ifelse(x==1, -1, ifelse(x==2, 0, 1))) +
I(ifelse(x==1, -1, ifelse(x==2, -1, 2))), data=samp.dat2))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
6.500e+00 3.727e-01 17.441 3.03e-08 ***
I(ifelse(x == 1, -1, ifelse(x == 2, 0, 1))) 4.000e+00 9.129e-01
4.382 0.00177 **
I(ifelse(x == 1, -1, ifelse(x == 2, -1, 2))) 1.813e-16 5.270e-01 3.44e-16
1.00000
Clearly, one could use ifelse() statements or other
means of coding as desired, but the use of contrasts()
is a more intuitive, and clear method. It is easier to look at a contrast
matrix and determine coding then to wade through a complex series of coding
statements. It is also easier to specify.