Description
Context
The df.merge()
API merges two DataFrame
s in Pandas.
Problem
Although using the default parameter can produce the same result, explicitly specify on
and how
produce better readability. The parameter on
states which columns to join on, and the parameter how
describes the join method (e.g., outer, inner). Also, the validate
parameter will check whether the merge is of a specified type. If the developer assumes the merge keys are unique in both left and right datasets, but that is not the case, and he does not specify this parameter, the result might silently go wrong. The merge operation is usually computationally and memory expensive. It is preferable to do the merging process in one stroke for performance consideration.
Solution
Developer should explicitly specify the parameters for merge operation.
Type
Generic
Existing Stage
Data Cleaning
Effect
Readability & Error-prone
Example
import pandas as pd
df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
- df3 = df1.merge(df2)
+ df3 = df1.merge(
+ df2,
+ how='inner',
+ on='key',
+ validate='m:m'
+ )