Pandas Get Dummies β pd.get_dummies()
Create dummy variables from categorical columns using pandas
When youβre doing machine learning youβll work with algorithms that cannot process categorical variables. In this case, you need to turn your column of labels (Ex: [βcatβ, βdogβ, βbirdβ, βcatβ]) into separate columns of 0s and 1s. This is called getting dummies pandas columns.
Pandas pd.get_dummies() will turn your categorical column (column of labels) into indicator columns (columns of 0s and 1s).
This function is heavily used within machine learning algorithms. For instance, random forrest doesnβt do great with columns that have labels. Itβs best to turn these into dummy indicator columns.
In the above scenario, we are creating dummy columns from our βNameβ column. A new column is created for every distinct value we had in our original βNameβ column.
Pseudo code: For each distinct value in your original categorical column, create a new column with an indictor (0 or 1).
Pandas Get Dummies
Be careful, if your categorical column has too many distinct values in it, youβll quickly explode your new dummy columns. Before you run pd.get_dummies(), make sure to run pd.Series.nunique() to see how many new columns youβll create.
Get Dummies Parameters
- data: The data that you want to create dummy indicator columns from. This will be your DataFrame or Series
- prefix (Default: None): Youβd use this column if you wanted to add a prefix (string at the beginning) of your new column names. This can be helpful for identifying which columns are dummy afterward.
- prefix_sep (Default: β-β): When you want to get fancy, you could specify what you want between your prefix and column names. The default is β-β. I doubt youβll ever change this. If you do, please tweet about it and @ DataIndepedent.
- dummy_na (Default: False): Used if you want to create a dummy column for your NA values.
- columns (Default: None): Defining which columns from your DataFrame youβd like to get dummies for. By default its every column of object or category type (no ints, floats, etc.)
- sparse (Default: False): Sometimes your dummy columns are sparse. This means there are a TON of 0s because you have a high number of distinct values in your original column. Sparse=True helps speed up the processing power in this case.
- drop_first (Default: False): Advanced option β only use this if you know what youβre doing. Dropping your first categorical variable is possible because if every other dummy column is 0, then this means your first value would have been 1. What you remove in redundancy, you gain confusion.
Hereβs a Jupyter notebook showing how to get dummies in Pandas