Hello fellow data scientist,
I currently reading the paper by Stekhoven & Brühlmann about MissForest. I was wondering how to deal with variables that are restricted by domain knowlege. I.e. no women can not have had prostate cancer in the past, so missing values are wanted for this item. Should I just exclude such variables (were missing values are wanted / inteded) from the MissForest imputation?
If so how can I combine these variables with the imputed datasets afterwards?
I hope this is specific enough. Thanks in advance
Usually it is better to first apply logical rules to fill some blanks, eventually followed by algorithmical imputation.
Take e.g. a data set about house characteristics. One column is "swimming pool" with either a 1 (yes) or a missing (no). Algorithmic imputation would set all missing to "1", destroying all information about having a pool or not.