Discrete values ( enum ) in dataset

by ozw1z5rd   Last Updated August 14, 2018 14:19 PM

I have a data set where some columns have discrete values like $x_2=('cat','dog','penguin')$, $x_3=( 'high', 'low')$ etc... how do I handle these values before to run a regression?

Do I have to convert them into integers like $x_2=(0,1,2)$, $x_3=(0,1)$?

Do I have to add more columns $x_{cat}, x_{dog}, x_{penguin}, x_{high}, x_{low}$ and assign them a value 0 or 1 ?



Answers 1


Converted to an answer from my comments.

Most modern software does this for you. It does something similar to what you outline in your last sentence for some meaning of similar. I would recommend letting your software do the heavy lifting here and if it does not offer this facility then choose a different software. There are some hints in this Q&A Dummy variables for categories in logistic regression and odd ratio which is for logistic regression but applies to linear and Poisson as well.

Note that your option $x_2$ may work as long as you tell the software these are categories not numerical values. Internally the software will do something like you last suggestion ($x_{cat}$) and so on.

mdewey
mdewey
August 14, 2018 13:28 PM

Related Questions


Dropping one of the columns when using one-hot encoding

Updated February 18, 2018 15:19 PM


Multiple categorical IVs in meta-regression

Updated April 26, 2018 18:19 PM

new values for categorical variable in testing set

Updated August 19, 2017 12:19 PM