Writing Model Formulæ

The formula for a model defines how all the variables fit together.

General Structure

The '~' operator goes between the two sides of the formula.  Generally, the left side is the outcome and the right side is the predictor(s).  Consider the following:

yi = β0 + β1xi1 + εi
or
ŷi = β0 + β1xi1

R automatically includes the intercept and the error, so it would just be:

y ~ x1

For a two predictor model:

yi = β0 + β1xi1 + β2xi2 + εi

y ~ x1 + x2

The '+' operator works as expected to add terms to the model.  This alone would let us specify virtually any model, but the process could be laborious, so R has some additional shortcuts.  Consider extending the two predictor model with an interaction.

yi = β0 + β1xi1 + β2xi2 + β3(xi1xi2) +εi

There are actually three options here.  The first one manually calculates the interaction term, but using the I() function to tell R to calculate everything inside I() first and treat it as a single term.  The next formula uses the ':' operator which operates uniquely in an R formula to indicate an interaction term.  Finally, we use the '*' operator which again operates uniquely to indicate main effects and an interaction.

y ~ x1 + x2 + I(x1*x2)
y ~ x1 + x2 + x1:x2
y ~ x1*x2

Notice how the effect of '*' changes depending whether it is in the main formula or wrapped in I().  When in the formula it expands to main effects of x1, x2, and their interaction.  But you are certainly not limited to two-way interactions.

yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4(xi1xi2) + β5(xi1xi3) + β6(xi2xi3) + β7(xi1xi2xi3) + εi

Again this could be specified manually using I() (not shown for obvious reasons), but the builtin function operators ':' and '*' are easier.  The second version expands to the first and is clearly more parsimonius (ergo preferred).

y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3
y ~ x1*x2*x3

Much as terms are added using '+' they can be dropped using '-'.  Two examples are shown.  The first drops the intercept and the second drops the three-way interaction term.

yi = β1xi1 + εi

y ~ x1 -1
y ~ -1 + x1

yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4(xi1xi2) + β5(xi1xi3) + β6(xi2xi3) + εi

y ~ x1*x2*x3 -x1:x2:x3

Again the use is fairly straight forward.  It is often easier to specify many interactions using '*' and simply drop one term than to write out each one.

Next we consider models with polynomial terms.  Although second and third degree polynomials are the most common, the framework in R is such that the degree is of little consequence.  In this example, we will specify a cubic polynomial in two ways.

yi = β0 + β1xi1 + β2xi12 + β3xi13 + εi

y ~ x1 + I(x1^2) + I(x1^3)
y ~ poly(x1, degree=3, raw=TRUE)

The use of poly() is simple enough.  The argument raw=TRUE is required to give the raw polynomials rather than orthogonal ones.  For any nth degree polynomial, just use degree=n.  Sometimes it is useful in models to constrain a variables coefficient to one.  This can be achieved using offset().

yi = β0 + β1xi1 + β2xi2 + xi3i

y ~ x1 + x2 + offset(x3)

In the output, the variable and it's coefficient will not be shown, but it will be included if you look at the formula.  An additional helpful feature for models is update().  This is particularly helpful in model testing where you are making relatively minor changes between formulas.  Below are some examples.

##Specify first model
> model.a <- lm(y ~ x1 + x2 + x3, data=DATA)
> formula(model.a)
y ~ x1 + x2 + x3

## Specify second model as the first plut an interaction term
> model.b <- update(model.a, . ~ . + x1:x3)
> formula(model.b)
y ~ x1 + x2 + x3 + x1:x3

##You decide that x2 is not a useful predictor and would like to drop it from model.b
> model.c <- update(model.b, . ~ . -x2)
> formula(model.c)
y ~ x1 + x3 + x1:x3

##Now suppose that you would like to see if the predictors in model.a fit some other outcome
> model.d <- update(model.a, yalt ~ .)
> formula(model.d)
yalt ~ x1 + x2 + x3

##Suppose you first fit a very complex model
> model.e <- lm(y ~ x1*x2*x3 + poly(x1, degree=5, raw=TRUE), data=DATA)
> formula(model.e)
y ~ x1 * x2 * x3 + poly(x1, degree = 5, raw = TRUE)

##But realized that it was excessive, and wished to drop all interaction terms
> model.f <- update(model.e, . ~ . -x1*x2*x3 + x1 + x2 + x3)
> formula(model.f)
y ~ poly(x1, degree = 5, raw = TRUE) + x1 + x2 + x3

##Note that because R does not allow duplicate terms in the model, model.f is different from model.g, because of the ordering
> model.g <- update(model.e, . ~ . + x1 + x2 + x3 -x1*x2*x3)
> formula(model.g)
y ~ poly(x1, degree = 5, raw = TRUE)

When updating the model formula, '.' represents the old formula.  This allows flexibility to include both sides of the old formula or only one, or even none (although there are limited instances where it is more efficient to use update() when you are not using the original formula at all).  I believe this is a fairly decent introduction to writing formulæ in R.  Below is a brief summary.

Summary:

'+'      : add terms to the model
'-'      : remove terms from a model
':'      : specify an interaction between two variables
'*'      : specify main effects and interactions
'.'      : when updating, it represents the old formula, if used directly in a model it specifies all other columns of the dataset
I()      : My neumonic is 'identity'; objects inside are interpreted as is
poly()   : calculates polynomials, use raw=TRUE to avoid the default orthogonal polynomials
offset() : fix a variable's coefficient to 1
update() : update an old model; can be used to change the formula or to update the model if the dataset was changed