Writing Model Formulæ
The formula for a model defines how all the variables fit together.
General Structure
The '~' operator goes between the two sides of the
formula. Generally, the left side is the outcome and the right side is the
predictor(s). Consider the following:
yi = β0 + β1xi1 + εi
or
ŷi = β0 + β1xi1
R automatically includes the intercept and the error, so it would just be:
y ~ x1
For a two predictor model:
yi = β0 + β1xi1 + β2xi2 + εi
y ~ x1 + x2
The '+' operator works as expected to add terms to the model. This alone
would let us specify virtually any model, but the process could be laborious, so
R has some additional shortcuts. Consider extending the two predictor
model with an interaction.
yi = β0 + β1xi1 + β2xi2 + β3(xi1xi2) +εi
There are actually three options here. The first one manually calculates
the interaction term, but using the I() function to
tell R to calculate everything inside I() first and
treat it as a single term. The next formula uses the
':' operator which operates uniquely in an R formula to indicate an
interaction term. Finally, we use the '*'
operator which again operates uniquely to indicate main effects and an
interaction.
y ~ x1 + x2 + I(x1*x2)
y ~ x1 + x2 + x1:x2
y ~ x1*x2
Notice how the effect of '*' changes depending
whether it is in the main formula or wrapped in I().
When in the formula it expands to main effects of x1, x2, and their interaction.
But you are certainly not limited to two-way interactions.
yi = β0 + β1xi1 + β2xi2 + β3xi3 +
β4(xi1xi2) + β5(xi1xi3) + β6(xi2xi3) +
β7(xi1xi2xi3) +
εi
Again this could be specified manually using I() (not
shown for obvious reasons), but the builtin function operators
':' and '*' are easier.
The second version expands to the first and is clearly more parsimonius (ergo
preferred).
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3
y ~ x1*x2*x3
Much as terms are added using '+' they can be dropped using '-'. Two
examples are shown. The first drops the intercept and the second drops the
three-way interaction term.
yi = β1xi1 +
εi
y ~ x1 -1
y ~ -1 + x1
yi = β0 + β1xi1 + β2xi2 + β3xi3 +
β4(xi1xi2) + β5(xi1xi3) + β6(xi2xi3)
+
εi
y ~ x1*x2*x3 -x1:x2:x3
Again the use is fairly straight forward. It is often easier to specify
many interactions using '*' and simply drop one term
than to write out each one.
Next we consider models with polynomial terms. Although second and third
degree polynomials are the most common, the framework in R is such that the
degree is of little consequence. In this example, we will specify a cubic
polynomial in two ways.
yi = β0 + β1xi1 + β2xi12 + β3xi13 + εi
y ~ x1 + I(x1^2) + I(x1^3)
y ~ poly(x1, degree=3, raw=TRUE)
The use of poly() is simple enough. The
argument raw=TRUE is required to give the raw
polynomials rather than orthogonal ones. For any nth degree polynomial,
just use degree=n. Sometimes it is useful in
models to constrain a variables coefficient to one. This can be achieved
using offset().
yi = β0 + β1xi1 + β2xi2
+ xi3 +εi
y ~ x1 + x2 + offset(x3)
In the output, the variable and it's coefficient will not be shown, but it will
be included if you look at the formula.
An additional helpful feature for models is update(). This is particularly
helpful in model testing where you are making relatively minor changes between
formulas. Below are some examples.
##Specify first model
> model.a <- lm(y ~ x1 + x2 + x3, data=DATA)
> formula(model.a)
y ~ x1 + x2 + x3
## Specify second model as the first plut an interaction term
> model.b <- update(model.a, . ~ . + x1:x3)
> formula(model.b)
y ~ x1 + x2 + x3 + x1:x3
##You decide that x2 is not a useful predictor and would like to drop it from
model.b
> model.c <- update(model.b, . ~ . -x2)
> formula(model.c)
y ~ x1 + x3 + x1:x3
##Now suppose that you would like to see if the predictors in model.a fit some
other outcome
> model.d <- update(model.a, yalt ~ .)
> formula(model.d)
yalt ~ x1 + x2 + x3
##Suppose you first fit a very complex model
> model.e <- lm(y ~ x1*x2*x3 + poly(x1, degree=5, raw=TRUE), data=DATA)
> formula(model.e)
y ~ x1 * x2 * x3 + poly(x1, degree = 5, raw = TRUE)
##But realized that it was excessive, and wished to drop all interaction terms
> model.f <- update(model.e, . ~ . -x1*x2*x3 + x1 + x2 + x3)
> formula(model.f)
y ~ poly(x1, degree = 5, raw = TRUE) + x1 + x2 + x3
##Note that because R does not allow duplicate terms in the model, model.f is
different from model.g, because of the ordering
> model.g <- update(model.e, . ~ . + x1 + x2 + x3 -x1*x2*x3)
> formula(model.g)
y ~ poly(x1, degree = 5, raw = TRUE)
When updating the model formula, '.' represents the
old formula. This allows flexibility to include both sides of the old
formula or only one, or even none (although there are limited instances where it
is more efficient to use update() when you are not
using the original formula at all). I believe this is a fairly decent
introduction to writing formulæ in R. Below is a brief summary.
Summary:
'+' : add terms to the model
'-' : remove terms from a model
':' : specify an interaction between two variables
'*' : specify main effects and interactions
'.' : when updating, it represents the old formula, if used directly in a model it specifies all other columns of the dataset
I() : My neumonic is 'identity'; objects inside are interpreted as is
poly() : calculates polynomials, use raw=TRUE to avoid the default orthogonal polynomials
offset() : fix a variable's coefficient to 1
update() : update an old model; can be used to change the formula or to update the
model if the dataset was changed