Pandas Merge β Join Data β pd.DataFrame.merge()
Pandas merge two DataFrames. Join datasets on common columns. Control join types and selected columns
Pandas Merge is another Top 10 Pandas function you must know. Simply, if you have two datasets that are related together, how do you bring them together?
Pandas Merge will join two DataFrames together resulting in a single, final dataset. You have full control how your two datasets are combined.
In this post, weβll review the mechanics of Pandas Merge and go over different scenarios to use it on. For a tutorial on the different types of joins, check out our future post on Data Joins.
There are 3 core ideas youβll need to grasp
- What data do you want to join? β This will be determined by your βrightβ and βleftβ datasets.
- How do you want to join them? β You have different ways to join two datasets β left, right, full, inner. Itβs up to you depending on your final output requirements.
- What do you want to join them with? β You also need to tell Pandas how your two datasets are related. What is the common column between the two?
Pseudo code: For a given DataFrame, join another related DataFrame that results in a new, combined dataset.
Pandas Merge
Pandas Merge Tip
When I merge two DataFrames, there are often columns I donβt want to merge in either dataset. For example, say I have two DataFrames with 100 columns distinct columns each, but I only care about 3 columns from each one.
Instead of joining two entire DataFrames together, Iβll only join a subset of columns together. See the code below.
Merge Parameters
- right: The DataFrame youβre calling
.merge()
is considered your βleftβ dataset. You need to specify your other dataset in theright
parameter. This can be another DataFrame or named Series. - how (βleftβ, βrightβ, βouterβ, βinnerβ, default= βinnerβ): How will determine βhowβ to join your two datasets together. Do you want to keep all of your samples from your left df? Or your right? Maybe just where they have common rows. For a tutorial on different types of joins, check out this resource.
- on: Sometimes called merge/join βkeyβ. The common column between your two datasets that youβd like to join on. Youβll use
on
when the two columns have the same name. If they arenβt named the same, then try left_ or right_ on - left_on/right_on: If your columns to join on do not have the same name, no problem, simply pass their names into left_on (for your left dataset) and right_on (for your right dataset).
- left_index/right_index: Alternatively, instead of specifying a column, if you column to join on sits within a DataFrameβs index, you can set left_index/right_index=True.
- suffixes (Default=(β_xβ, β_yβ): If you wanted to add a suffix (to help tell which columns came from which DataFrame) to the end of your newly-merged columns you can add them here.
- indicator (Default = False): Helpful function that will attribute each row to a specific DataFrame or both. Simply, was this row only on the left side, right side, or was it shared by both?
- validate: Do a sanity check on both of your datasets before you merge. One of the βgotchasβ with merges is when there are unknown/unintended duplicates in either of your datasets. The
validate
parameter will help you check for these.- βone_to_oneβ or β1:1β: checks if the merge column is unique in both left and right datasets.
- βone_to_manyβ or β1:mβ: checks if the merge column is unique in left dataset
- βmany_to_oneβ or βm:1β: checks if the merge column is unique in right dataset
- βmany_to_manyβ or βm:mβ: does not check duplicates in either column
Letβs take a look at a code sample