MissForest for SurveyData

by PythonBeginner   Last Updated June 12, 2019 07:19 AM - source

Hello fellow data scientist,

I currently reading the paper by Stekhoven & Brühlmann about MissForest. I was wondering how to deal with variables that are restricted by domain knowlege. I.e. no women can not have had prostate cancer in the past, so missing values are wanted for this item. Should I just exclude such variables (were missing values are wanted / inteded) from the MissForest imputation?

If so how can I combine these variables with the imputed datasets afterwards?

I hope this is specific enough. Thanks in advance

Answers 1

Usually it is better to first apply logical rules to fill some blanks, eventually followed by algorithmical imputation.

Take e.g. a data set about house characteristics. One column is "swimming pool" with either a 1 (yes) or a missing (no). Algorithmic imputation would set all missing to "1", destroying all information about having a pool or not.

Michael M
Michael M
June 12, 2019 07:09 AM

Related Questions