Applying an interaction term to all the IVs

by Robert Kubrick   Last Updated July 12, 2019 10:19 AM - source

I have a linear model with 6 IVs and would like to analyze the effect of an interaction term applied to all the IVs.

To illustrate, let's say we're predicting the Win/Loose ratio of NBA basketball teams based on a number of players statistics and we want to add the number of spectators coming to the games as an interaction term to all the predictors. The idea is that a higher fans participation in the stadiums will leverage the players skills. Vice-versa if stadiums register low participation (look at the Nets), it will negatively affect the players ability to perform at their best or average levels (side note: we do not want to use the number of spectators as a predictor per se).

In MLR terms the model would be: $$ \hat{Y} = c + b_1X_1 + b_2X_2 + ... + b_nX_n + a_1I_1X_1 + a_2I_2X_2 + ... + a_nI_nX_n$$

Where $X_n$ are the players statistics and $I_n$ is a measure of crowd participation.

If the players skills set (skills IVs) is large, the interaction term will double the model terms, with a higher chance of over-fitting the model data and probably decreasing the predictive ability of the model.

Are there other methods than multivariate regression to adjust the linear coefficients given one or more "background" variables? Or is there a way to reduce the number of terms?



Answers 3


Your equation has "a1I1X1" without any simpler term "b1I1". It is a rare case in which one would want to include an interaction term without including a main effect term for each variable in a tested interaction. This page tells why, if you look for those arguments, which are mixed in among the descriptions of the rare cases when those arguments might be bypassed.

rolando2
rolando2
May 19, 2012 17:20 PM

I agree with @rolando2 that the background variable should be included in the model. I'm actually not convinced that the interaction terms need to be. You want to be wary of most attempts to "reduce the number of terms" (see here), but it is acceptable to conduct an a-priori nested model test. I would test the model with the first-order terms (including fan participation) against the full model with all the first-order terms plus all the two-way interactions you are interested in. You would then stick with the model that's better.

It is also true, as you state, that having many predictors (relative to the size of the total data set) increases out of sample error. However, I think you are using the wrong type of model; you actually have more data to fit your parameters than you think. Instead of using the win/loss ratio as your predicted variable, you should use each game (a Bernoulli trial--either a win or a loss) as your response variable. That is, you should be doing logistic regression. Moreover, you need to use a multilevel model, because the games are nested within teams. You will want to explore the generalized estimating equations to fit such a model. My main point here, however, is that you have many more observations to estimate your parameters than you assume, and thus, you are at less risk of overfitting than you fear.

gung
gung
May 19, 2012 18:36 PM

I just want to mention that in medicine certain type of interactions are important. They are variables that interact with treatment in a way that you would change the prescription of a drug treatment for example based on this covariates value. It is not always the case that these variables will be among the strongest for predicting the response. Hence standard subset selection methods may not pick these variables. But we would want to include them in the model to let the physician make personalized treatment decisions. Lacey Gunter developed a method to identify these variables in linear regression models in her PhD dissertation at the University of Michigan. This work was published in Statistical Methodology in 2010. In 2011 I coauthored a paper with her in the Pakistan Journal of Statistics and Operations Research which simplified the method in a way to make it applicable in logistic regression and Cox proportional hazards regression. A summary paper on these methods was published in the Journal of Biopharmaceutical Statistics in November 2011.

This idea might extend to other fields of application. If an important variable interacts with another variable where interpretation of the important variable is part of the decisionmaking identifying the variable that interacts with the important variable should be identified and these methods can be used to do that. If this applies to your problem than I think my answer is exceptionally helpful. If not it is at least enlightening information.

Michael Chernick
Michael Chernick
May 19, 2012 22:08 PM

Related Questions