포스트

Merge_And_Join

Merge_And_Join

Combining Datasets: Merge and Join

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np


class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""

    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)

    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

관계 대수 Relational Algebra

pd.merge()에는 관계 데이터를 조작하는 규칙의 정형 집합이자 대부분의 데이터베이스에서 사용할 수 있는 연산의 개념적 기반을 형성하는 관계 대수의 하위 집합에 해당하는 행위가 구현
관계 대수 방식의 강점으로 데이터세트에 대한 복잡한 연산의 기본 구성요소가 되는 몇 가지 기초 연산은 제안함

조인 작업의 분류 Categories of Joins

pd.merge() 함수는 일대일, 다대일, 다대다 조인 같은 여러 가지 조인 유형을 구현
이 세가지 유형의 조인은 pd.merge() 인터페이스에서 동일한 호출을 통해 사용

일대일 조인 One-to-one joins

1
2
3
4
5
6
7
8
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
# 공통된 열을 가지고 있는 것을 파악해 자동으로 키로 사용해 조인 --> 인덱스를 버림
df3 = pd.merge(df1, df2)
df3
display('df1', 'df2', 'df3')

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df2

employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014

df3

employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014

다대일 조인 Many-to-one joins

1
2
3
4
# 두 개의 키 열 중 하나가 중복된 항목을 포함하는 경우 --> 중복을 보존
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

df3

employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014

df4

groupsupervisor
0AccountingCarly
1EngineeringGuido
2HRSteve

pd.merge(df3, df4)

employeegrouphire_datesupervisor
0BobAccounting2008Carly
1JakeEngineering2012Guido
2LisaEngineering2004Guido
3SueHR2014Steve

다대다 조인 Many-to-many joins

1
2
3
4
5
6
# 두 배열의 키 열에 모두 중복 항목이 존재
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df5

groupskills
0Accountingmath
1Accountingspreadsheets
2Engineeringcoding
3Engineeringlinux
4HRspreadsheets
5HRorganization

pd.merge(df1, df5)

employeegroupskills
0BobAccountingmath
1BobAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4LisaEngineeringcoding
5LisaEngineeringlinux
6SueHRspreadsheets
7SueHRorganization

병합 키 지정 Specification of the Merge Key

The on keyword

1
2
# 키 열의 이름을 명시적으로 지정
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df2

employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014

pd.merge(df1, df2, on='employee')

employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014

The left_on and right_on keywords

1
2
3
4
# 다른 열 이름을 가진 두 데이터세트 병합 --> 불필요한 중복 열 발생
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df3

namesalary
0Bob70000
1Jake80000
2Lisa120000
3Sue90000

pd.merge(df1, df3, left_on="employee", right_on="name")

employeegroupnamesalary
0BobAccountingBob70000
1JakeEngineeringJake80000
2LisaEngineeringLisa120000
3SueHRSue90000
1
2
# drop() 메서드를 통한 중복된 열 삭제
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
employeegroupsalary
0BobAccounting70000
1JakeEngineering80000
2LisaEngineering120000
3SueHR90000

The left_index and right_index keywords

1
2
3
4
# 병합 키로 인덱스를 사용
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a', "pd.merge(df1a, df2a, left_index=True, right_index=True)")

df1a

group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR

df2a

hire_date
employee
Lisa2004
Bob2008
Jake2012
Sue2014

pd.merge(df1a, df2a, left_index=True, right_index=True)

grouphire_date
employee
BobAccounting2008
JakeEngineering2012
LisaEngineering2004
SueHR2014
1
2
# DataFrame은 기본적으로 join() 메서드를 통해 인덱스 기반 조인
display('df1a', 'df2a', 'df1a.join(df2a)')

df1a

group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR

df2a

hire_date
employee
Lisa2004
Bob2008
Jake2012
Sue2014

df1a.join(df2a)

grouphire_date
employee
BobAccounting2008
JakeEngineering2012
LisaEngineering2004
SueHR2014
1
2
# 인덱스와 열을 섞고자 할 때 사용
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")

df1a

group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR

df3

namesalary
0Bob70000
1Jake80000
2Lisa120000
3Sue90000

pd.merge(df1a, df3, left_index=True, right_on='name')

groupnamesalary
0AccountingBob70000
1EngineeringJake80000
2EngineeringLisa120000
3HRSue90000

조인을 위한 집합 연산 지정 Specifying Set Arithmetic for Joins

1
2
3
4
5
6
7
8
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
# 기본적으로 교집합 --> how='inner' 를 통해 명시적 지정
display('df6', 'df7', "pd.merge(df6, df7, how='inner')")

df6

namefood
0Peterfish
1Paulbeans
2Marybread

df7

namedrink
0Marywine
1Josephbeer

pd.merge(df6, df7, how='inner')

namefooddrink
0Marybreadwine
1
2
# 합집합 --> 누락된 값은 NA로 채움
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")

df6

namefood
0Peterfish
1Paulbeans
2Marybread

df7

namedrink
0Marywine
1Josephbeer

pd.merge(df6, df7, how='outer')

namefooddrink
0JosephNaNbeer
1Marybreadwine
2PaulbeansNaN
3PeterfishNaN
1
2
# 오른쪽 항목과 왼쪽 항목을 기준으로 조인
display("pd.merge(df6, df7, how='right')", "pd.merge(df6, df7, how='left')")

pd.merge(df6, df7, how='right')

namefooddrink
0Marybreadwine
1JosephNaNbeer

pd.merge(df6, df7, how='left')

namefooddrink
0PeterfishNaN
1PaulbeansNaN
2Marybreadwine

열 이름이 겹치는 경우(Overlapping): The suffixes Keyword

1
2
3
4
5
6
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
# 충돌 열이 있을 경우 자동으로 접미사 _x나 _y를 덧붙임
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

df8

namerank
0Bob1
1Jake2
2Lisa3
3Sue4

df9

namerank
0Bob3
1Jake1
2Lisa4
3Sue2

pd.merge(df8, df9, on="name")

namerank_xrank_y
0Bob13
1Jake21
2Lisa34
3Sue42
1
2
# 접미사를 suffixes를 통해 별도로 지정
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')

df8

namerank
0Bob1
1Jake2
2Lisa3
3Sue4

df9

namerank
0Bob3
1Jake1
2Lisa4
3Sue2

pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])

namerank_Lrank_R
0Bob13
1Jake21
2Lisa34
3Sue42

Example: US States Data

1
2
3
4
5
pop = pd.read_csv('./state-population.csv')
areas = pd.read_csv('./state-areas.csv')
abbrevs = pd.read_csv('./state-abbrevs.csv')

display('pop.head()', 'areas.head()', 'abbrevs.head()')

pop.head()

state/regionagesyearpopulation
0ALunder1820121117489.0
1ALtotal20124817528.0
2ALunder1820101130966.0
3ALtotal20104785570.0
4ALunder1820111125763.0

areas.head()

statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707

abbrevs.head()

stateabbreviation
0AlabamaAL
1AlaskaAK
2ArizonaAZ
3ArkansasAR
4CaliforniaCA
1
2
3
4
5
# 데이터 제거를 막기 위해 outer 사용
merged = pd.merge(pop, abbrevs, how='outer',
                  left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', axis=1)  # drop duplicate info
merged.head()
state/regionagesyearpopulationstate
0AKtotal1990553290.0Alaska
1AKunder181990177502.0Alaska
2AKtotal1992588736.0Alaska
3AKunder181991182180.0Alaska
4AKunder181992184878.0Alaska
1
2
# 널 값을 가진 행을 찾아 불일치 항목 확인
merged.isnull().any()
1
2
3
4
5
6
state/region    False
ages            False
year            False
population       True
state            True
dtype: bool
1
2
# 널 값 확인
merged[merged['population'].isnull()].head()
state/regionagesyearpopulationstate
1872PRunder181990NaNNaN
1873PRtotal1990NaNNaN
1874PRtotal1991NaNNaN
1875PRunder181991NaNNaN
1876PRtotal1993NaNNaN
1
2
# 일치되는 값이 없는 지역 확인
merged.loc[merged['state'].isnull(), 'state/region'].unique()
1
array(['PR', 'USA'], dtype=object)
1
2
3
4
# 주 이름의 약어 키에는 없는 항목(state)을 채워 넣기
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
1
2
3
4
5
6
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool
1
2
final = pd.merge(merged, areas, on='state', how='left')
final.head()
state/regionagesyearpopulationstatearea (sq. mi)
0AKtotal1990553290.0Alaska656425.0
1AKunder181990177502.0Alaska656425.0
2AKtotal1992588736.0Alaska656425.0
3AKunder181991182180.0Alaska656425.0
4AKunder181992184878.0Alaska656425.0
1
2
# area 널 값 발생
final.isnull().any()
1
2
3
4
5
6
7
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool
1
2
# area 널 값을 가진 지역 확인
final['state'][final['area (sq. mi)'].isnull()].unique()
1
array(['United States'], dtype=object)
1
2
3
# 널 값 삭제
final.dropna(inplace=True)
final.head()
state/regionagesyearpopulationstatearea (sq. mi)
0AKtotal1990553290.0Alaska656425.0
1AKunder181990177502.0Alaska656425.0
2AKtotal1992588736.0Alaska656425.0
3AKunder181991182180.0Alaska656425.0
4AKunder181992184878.0Alaska656425.0
1
2
3
4
5
6
import numexpr as ne

data2010 = final.query("year == 2010 & ages == 'total'")
data2011 = final.query("year == 2011 & ages == 'total'")
data2012 = final.query("year == 2012 & ages == 'total'")
display("data2010.head()", "data2011.head()", "data2012.head()")

data2010.head()

state/regionagesyearpopulationstatearea (sq. mi)
43AKtotal2010713868.0Alaska656425.0
51ALtotal20104785570.0Alabama52423.0
141ARtotal20102922280.0Arkansas53182.0
149AZtotal20106408790.0Arizona114006.0
197CAtotal201037333601.0California163707.0

data2011.head()

state/regionagesyearpopulationstatearea (sq. mi)
45AKtotal2011723375.0Alaska656425.0
53ALtotal20114801627.0Alabama52423.0
139ARtotal20112938506.0Arkansas53182.0
147AZtotal20116468796.0Arizona114006.0
195CAtotal201137668681.0California163707.0

data2012.head()

state/regionagesyearpopulationstatearea (sq. mi)
47AKtotal2012730307.0Alaska656425.0
49ALtotal20124817528.0Alabama52423.0
143ARtotal20122949828.0Arkansas53182.0
145AZtotal20126551149.0Arizona114006.0
193CAtotal201237999878.0California163707.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 각 연도별 인구 밀도 계산 및 재배열
data2010.set_index('state', inplace=True)
density10 = data2010['population'] / data2010['area (sq. mi)']
density10.sort_values(ascending=False, inplace=True)
# density2010.sort_values(ascending=False, inplace=True)

data2011.set_index('state', inplace=True)
density11 = data2011['population'] / data2011['area (sq. mi)']
density11.sort_values(ascending=False, inplace=True)

data2012.set_index('state', inplace=True)
density12 = data2012['population'] / data2012['area (sq. mi)']
density12.sort_values(ascending=False, inplace=True)
# density2012.sort_values(ascending=False, inplace=True)

# 인구 밀도가 가장 높은 지역과 낮은 지역
print(density11.head(), "\n\n", density11.tail())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
state
District of Columbia    9112.117647
Puerto Rico             1048.813656
New Jersey              1013.143660
Rhode Island             679.838188
Connecticut              647.357143
dtype: float64 

 state
South Dakota    10.681552
North Dakota     9.686397
Montana          6.784272
Wyoming          5.799843
Alaska           1.101992
dtype: float64
1
2
3
4
5
6
7
8
9
10
11
# 인구 밀도가 매 해 증가하는 지역
increasing_states = []
for state in final['state'].unique():
    density2010 = density10[state]
    density2011 = density11[state]
    density2012 = density12[state]

    if (density2010 < density2011).all() and (density2011 < density2012).all():
        increasing_states.append(state)

print(increasing_states)
1
['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts', 'Maryland', 'Maine', 'Minnesota', 'Missouri', 'Mississippi', 'Montana', 'North Carolina', 'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Washington', 'Wisconsin', 'West Virginia', 'Wyoming']
이 기사는 저작권자의 CC BY-NC 4.0 라이센스를 따릅니다.