Pandas Value Counts β pd.Series.value_counts()
Pandas value counts to analyze column value distribution
Often when youβre doing exploratory data analysis (EDA), youβll need to get a better feel for a column. One of the best ways to do this is to understand the distribution of values with you column. This is where Pandas Value Counts comes in.
Pandas Series.value_counts() function returns a Series containing the counts (number) of unique values in your Series. By default the resulting series will be in descending order so that the first element is the most frequent element.
I usually do this when I want to get a bit more intimate with my date. My workflow goes:
- Run pandas.Series.nunique() first β This will count how many unique values I have. If itβs +100K itβll slow down my computer once I call value_counts
- Run pandas.Series.value_counts() β This will tell me which values appear most frequently
Pseudo code: Take a DataFrame column (or Series) and find the distinct values. Then count how many times each distinct value occurs.
Hint: You can also do this across unique rows in a DataFrame by calling pandas.DataFrame.value_counts()
Pandas Value Counts
By default, you donβt need to input any parameters when counting the values. Letβs take a look at the different parameters you can pass pd.Series.value_counts():
- normalize (Default: False): If true, then youβll return the relative frequencies of unique values. This means that instead of returning counts, you Series returned will be the percent each unique value makes up of the whole series.
- sort (Default: True): This will return your values in the frequency order. The exact order is determined by the next parameter (ascending)
- ascending (Default: False): If true, ascending will return your values in ascending order (lowest ones on top). By default your highest values appear first.
- bins: Sometimes youβre working with a continuous variable (think a range of numbers vs discrete labels). In this case youβll have too many unique values to pull signal from your data. If you set bins (Ex: [0, .25, .5, .75, 1], youβll assign your values a bin based off of where they fall. value_counts will then count the bin frequency vs distinct value frequency. Check out the video or code below for more.
- dropna (Default: True): This will either count (False) or not count (True) your NaNs in your Series.
Hereβs a Jupyter notebook showing how to set index in Pandas