The 6 AI Engineering Patterns, come build with Greg live:Β Starts Jan 6th, 2025
Leverage
Pandas functions

Pandas Merge – Join Data – pd.DataFrame.merge()

Pandas merge two DataFrames. Join datasets on common columns. Control join types and selected columns

Pandas Merge is another Top 10 Pandas function you must know. Simply, if you have two datasets that are related together, how do you bring them together?

Pandas Merge will join two DataFrames together resulting in a single, final dataset. You have full control how your two datasets are combined.

In this post, we’ll review the mechanics of Pandas Merge and go over different scenarios to use it on. For a tutorial on the different types of joins, check out our future post on Data Joins.

There are 3 core ideas you’ll need to grasp

  1. What data do you want to join? – This will be determined by your β€˜right’ and β€˜left’ datasets.
  2. How do you want to join them? – You have different ways to join two datasets – left, right, full, inner. It’s up to you depending on your final output requirements.
  3. What do you want to join them with? – You also need to tell Pandas how your two datasets are related. What is the common column between the two?
1. pd.DataFrame.merge(other_dataset,
                      how='type_of_join',
                      on='columns_to_join_on')

Pseudo code: For a given DataFrame, join another related DataFrame that results in a new, combined dataset.

Pandas Merge

Pandas Merge Tip

When I merge two DataFrames, there are often columns I don’t want to merge in either dataset. For example, say I have two DataFrames with 100 columns distinct columns each, but I only care about 3 columns from each one.

Instead of joining two entire DataFrames together, I’ll only join a subset of columns together. See the code below.

# This join brings together the entire DataFame
df.merge(df2)

# This join only brings together a subset of columns
# 'col1' is my key in this case
df[['col1', 'col2']].merge(df2[['col1', 'col3']], on='col1')

Merge Parameters

  • right: The DataFrame you’re calling .merge() is considered your β€˜left’ dataset. You need to specify your other dataset in the right parameter. This can be another DataFrame or named Series.
  • how (β€˜left’, β€˜right’, β€˜outer’, β€˜inner’, default= β€˜inner’): How will determine β€˜how’ to join your two datasets together. Do you want to keep all of your samples from your left df? Or your right? Maybe just where they have common rows. For a tutorial on different types of joins, check out this resource.
  • on: Sometimes called merge/join β€˜key’. The common column between your two datasets that you’d like to join on. You’ll use on when the two columns have the same name. If they aren’t named the same, then try left_ or right_ on
  • left_on/right_on: If your columns to join on do not have the same name, no problem, simply pass their names into left_on (for your left dataset) and right_on (for your right dataset).
  • left_index/right_index: Alternatively, instead of specifying a column, if you column to join on sits within a DataFrame’s index, you can set left_index/right_index=True.
  • suffixes (Default=(β€˜_x’, β€˜_y’): If you wanted to add a suffix (to help tell which columns came from which DataFrame) to the end of your newly-merged columns you can add them here.
  • indicator (Default = False): Helpful function that will attribute each row to a specific DataFrame or both. Simply, was this row only on the left side, right side, or was it shared by both?
  • validate: Do a sanity check on both of your datasets before you merge. One of the β€˜gotchas’ with merges is when there are unknown/unintended duplicates in either of your datasets. The validate parameter will help you check for these.
    • β€œone_to_one” or β€œ1:1”: checks if the merge column is unique in both left and right datasets.
    • β€œone_to_many” or β€œ1:m”: checks if the merge column is unique in left dataset
    • β€œmany_to_one” or β€œm:1”: checks if the merge column is unique in right dataset
    • β€œmany_to_many” or β€œm:m”: does not check duplicates in either column

Let’s take a look at a code sample

Link to code

Official Documentation

On this page