The 6 AI Engineering Patterns, come build with Greg live:Β Starts Jan 6th, 2025
Leverage
Pandas functions

Pandas Duplicated – pd.Series.duplicated()

Identify duplicate values in pandas

You may want to know if you have duplicate values in your DataFrame or Series. That’s where Pandas Duplicated or pd.Series.Duplicated() comes in.

You use pandas duplicated when you want to remove repeat value, or flag them for further analysis.

To do this simply call: YourSeries.duplicated() to see which values appear more than once.

But you may want to treat your duplicates differently. Do you know want to know about the first duplicate? or the last? Pandas lets you pick. However, it’s a bit counter intuitive, let’s look at the options.

  • Method 1 – Keep=’First’: For when you want to mark all duplicates as true…EXCEPT for the first one.
  • Method 2 – Keep=’Last’: For when you want to mark all duplicates as true…EXCEPT for the last one.
  • Method 3 – Keep=False: For when you want to mark all duplicates as true.

Pandas duplicated

Check out how the different options below match up against each other. We have a pandas Series listing out different cities in the US. San Francisco and Dallas appear multiple times and therefore are duplicates. New York and Miami only appear one and are not duplicates.

Notice how New York and Miami never return True with .duplicated(). This is because there aren’t any duplicates!

However, San Francisco and Dallas do return true. Well, sometimes, depending on the β€œkeep” you choose.

Here’s another example in a jupyter notebook

Link to code

Official Documentation

On this page