Pandas Duplicated β pd.Series.duplicated()
Identify duplicate values in pandas
You may want to know if you have duplicate values in your DataFrame or Series. Thatβs where Pandas Duplicated or pd.Series.Duplicated() comes in.
You use pandas duplicated when you want to remove repeat value, or flag them for further analysis.
To do this simply call: YourSeries.duplicated() to see which values appear more than once.
But you may want to treat your duplicates differently. Do you know want to know about the first duplicate? or the last? Pandas lets you pick. However, itβs a bit counter intuitive, letβs look at the options.
- Method 1 β Keep=βFirstβ: For when you want to mark all duplicates as trueβ¦EXCEPT for the first one.
- Method 2 β Keep=βLastβ: For when you want to mark all duplicates as trueβ¦EXCEPT for the last one.
- Method 3 β Keep=False: For when you want to mark all duplicates as true.
Pandas duplicated
Check out how the different options below match up against each other. We have a pandas Series listing out different cities in the US. San Francisco and Dallas appear multiple times and therefore are duplicates. New York and Miami only appear one and are not duplicates.
Notice how New York and Miami never return True with .duplicated(). This is because there arenβt any duplicates!
However, San Francisco and Dallas do return true. Well, sometimes, depending on the βkeepβ you choose.
Hereβs another example in a jupyter notebook