The 6 AI Engineering Patterns, come build with Greg live:Β Starts Jan 6th, 2025
Leverage
Pandas functions

Pandas Get Dummies – pd.get_dummies()

Create dummy variables from categorical columns using pandas

When you’re doing machine learning you’ll work with algorithms that cannot process categorical variables. In this case, you need to turn your column of labels (Ex: [β€˜cat’, β€˜dog’, β€˜bird’, β€˜cat’]) into separate columns of 0s and 1s. This is called getting dummies pandas columns.

Pandas pd.get_dummies() will turn your categorical column (column of labels) into indicator columns (columns of 0s and 1s).

1. pd.get_dummies(your_data)

This function is heavily used within machine learning algorithms. For instance, random forrest doesn’t do great with columns that have labels. It’s best to turn these into dummy indicator columns.

In the above scenario, we are creating dummy columns from our β€œName” column. A new column is created for every distinct value we had in our original β€œName” column.

Pseudo code: For each distinct value in your original categorical column, create a new column with an indictor (0 or 1).

Pandas Get Dummies

Be careful, if your categorical column has too many distinct values in it, you’ll quickly explode your new dummy columns. Before you run pd.get_dummies(), make sure to run pd.Series.nunique() to see how many new columns you’ll create.

Get Dummies Parameters

  • data: The data that you want to create dummy indicator columns from. This will be your DataFrame or Series
  • prefix (Default: None): You’d use this column if you wanted to add a prefix (string at the beginning) of your new column names. This can be helpful for identifying which columns are dummy afterward.
  • prefix_sep (Default: β€œ-β€œ): When you want to get fancy, you could specify what you want between your prefix and column names. The default is β€œ-β€œ. I doubt you’ll ever change this. If you do, please tweet about it and @ DataIndepedent.
  • dummy_na (Default: False): Used if you want to create a dummy column for your NA values.
  • columns (Default: None): Defining which columns from your DataFrame you’d like to get dummies for. By default its every column of object or category type (no ints, floats, etc.)
  • sparse (Default: False): Sometimes your dummy columns are sparse. This means there are a TON of 0s because you have a high number of distinct values in your original column. Sparse=True helps speed up the processing power in this case.
  • drop_first (Default: False): Advanced option – only use this if you know what you’re doing. Dropping your first categorical variable is possible because if every other dummy column is 0, then this means your first value would have been 1. What you remove in redundancy, you gain confusion.

Here’s a Jupyter notebook showing how to get dummies in Pandas

Link to code

Official Documentation

On this page