Pandas Drop Duplicates β pd.df.drop_duplicates()
Drop duplicates in pandas DataFrame
Do you ever have repeat rows in your data when you donβt want to? Pandas Drop duplicates will remove these for you.
Pandas DataFrame.drop_duplicates() will remove any duplicate rows (or duplicate subset of rows) from your DataFrame. It is super helpful when you want to make sure you data has a unique key or unique rows.
This function is a combination of DataFrame.drop() and DataFrame.duplicated().
I use this function most when I have a column that represents a unique id of an object. Iβll run .drop_duplicates() specifying my unique column as the subset.
Pseudo code: Look at all the rows (or subset of columns with your rows) and see if there are duplicates. If so, drop βem.
Pandas Drop Duplicates
.drop_duplicates() is pretty straight forward, the two decisions youβll have to make are 1) What subset of your data do you want pandas to evaluate for duplicates? and 2) Do you want to keep the first, or last or none of your duplicates?
Duplicate Parameters
- subset: By default, Pandas will look at your entire row to see if it is a duplicate of any other entire row. However, you can tell pandas to only look at a subset of columns to look for duplicates vs the whole. The subset parameter specifies what subset of columns you would like pandas to evaluate.
- keep (Default: βfirstβ): If you have two duplicate rows, you can also tell pandas which one(s) to drop. keep=βfirstβ will keep the first duplicate and drop the rest. Keep=βlastβ will keep the last duplicate and drop the last. None will drop all of them.
- inplace (Default: False): If true, you would like to do your operation in place (write over your current DataFrame). If false, then your DataFrame will be returned to you.
- ignore_index (Default: False): If True, then the axis returned to you will be labeled cleaning 0, 1, 2, β¦, n-1. If not, then the prior index will be used with the index labels of the drop rows dropped as well.
Hereβs a Jupyter notebook showing how to set index in Pandas