by dshin
Last Updated August 10, 2018 10:19 AM

I'm in a setting where I am trying to model a continuous output variable given ~100 variables and ~100k datapoints. The signal-to-noise ratio is extremely low, and colinearity is very high. Among the variables are many "needle-in-a-haystack" binary-valued features. A "needle-in-a-haystack" binary-valued feature, $f$, is one where $Pr[f==1]$ is small (~0.01), but where it is important for our model to be unbiased when $f==1$.

When I use OLS, the resultant model is properly unbiased when $f==1$. However, the model has undesirable characteristics stemming from noise and colinearity.

When I try elastic-net regularization, the noise/colinearity problems go away. However, it appears that the act of regularizing causes the model to disregard bias for the needle-in-a-haystack features. Even when $f$ is selected by the model, the model generates unacceptably large residuals when $f==1$.

I'm wondering how I can get the best of both worlds. I am currently training an elastic-net regularized model first, and then training a second OLS model to predict the residuals from the needle-in-haystack features. This seems to work decently, but I'm wondering if there is a more standard way.

- Serverfault Help
- Superuser Help
- Ubuntu Help
- Webapps Help
- Webmasters Help
- Programmers Help
- Dba Help
- Drupal Help
- Wordpress Help
- Magento Help
- Joomla Help
- Android Help
- Apple Help
- Game Help
- Gaming Help
- Blender Help
- Ux Help
- Cooking Help
- Photo Help
- Stats Help
- Math Help
- Diy Help
- Gis Help
- Tex Help
- Meta Help
- Electronics Help
- Stackoverflow Help
- Bitcoin Help
- Ethereum Help