Working_With_Strings

게시 2024/05/01

By Jinhyeok Ko

조회 6 분읽는 시간

벡터화된 문자열 연산 Working_With_Strings

Pandas 문자열 연산 Pandas String Operations

  
import numpy as np

x = np.array([2, 3, 5, 7, 11, 13])
# 벡터화 연산
x * 2

array([ 4,  6, 10, 14, 22, 26])

  
data = ['peter', 'Paul', 'MARY', 'gUIDO']
# NumPy 에서는 문자열 배열에 간단히 접근 불가 --> 루프 구문 사용
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

  
import pandas as pd

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names

  peter
   Paul
   None
   MARY
  gUIDO
dtype: object

Tables of Pandas String Methods

  
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

Methods similar to Python string methods

Pandas str methods


`len()`	`lower()`	`translate()`	`islower()`
`ljust()`	`upper()`	`startswith()`	`isupper()`
`rjust()`	`find()`	`endswith()`	`isnumeric()`
`center()`	`rfind()`	`isalnum()`	`isdecimal()`
`zfill()`	`index()`	`isalpha()`	`split()`
`strip()`	`rindex()`	`isdigit()`	`rsplit()`
`rstrip()`	`capitalize()`	`isspace()`	`partition()`
`lstrip()`	`swapcase()`	`istitle()`	`rpartition()`

  
# 일부 메서드는 일련의 문자열을 반환
monte.str.lower()

  graham chapman
     john cleese
   terry gilliam
       eric idle
     terry jones
   michael palin
dtype: object

  
# 일부 메서드는 숫자를 반환
monte.str.len()

  14
  11
  13
   9
  11
  13
dtype: int64

  
# 일부 메서드는 부울 값을 반환
monte.str.startswith('T')

  False
  False
   True
  False
   True
  False
dtype: bool

  
# 일부 메서드는 각 요소에 대한 리스트나 다른 복합 값을 반환
monte.str.split()

  [Graham, Chapman]
     [John, Cleese]
   [Terry, Gilliam]
       [Eric, Idle]
     [Terry, Jones]
   [Michael, Palin]
dtype: object

Methods using regular expressions

Pandas 메서드와 파이썬 re 모듈 함수 사이의 매핑

Method	Description	EngDescription
`match()`	각 요소에 re.match()를 호출, 부울 값을 반환	Call `re.match()` on each element, returning a boolean.
`extract()`	각 요소에 re.math()를 호출, 문자열로 매칭된 그룹을 반환	Call `re.match()` on each element, returning matched groups as strings.
`findall()`	각 요소에 re.findall()을 호출	Call `re.findall()` on each element
`replace()`	패턴이 발생한 곳을 다른 문자열로 대체	Replace occurrences of pattern with some other string
`contains()`	각 요소에 re.search()를 호출, 부울 값을 반환	Call `re.search()` on each element, returning a boolean
`count()`	패턴의 발생 건수를 집계	Count occurrences of pattern
`split()`	str.split()과 동일하지만 정규 표현식을 취함	Equivalent to `str.split()`, but accepts regexps
`rsplit()`	str.rsplit()과 동일하지만 정규 표현식을 취함	Equivalent to `str.rsplit()`, but accepts regexps

  
# 각 요소의 시작 문자와 붙어있는 그룹을 요청해 각 요소로부터 이름 부분을 추출
monte.str.extract('([A-Za-z]+)', expand=False)

   Graham
     John
    Terry
     Eric
    Terry
  Michael
dtype: object

  
# 문자열 시작(^)과 문자열 끝($)을 나타내는 정규 표현식을 사용해 자음으로 시작하고 끝나는 모든 이름 찾기
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

  [Graham Chapman]
                []
   [Terry Gilliam]
                []
     [Terry Jones]
   [Michael Palin]
dtype: object

Miscellaneous methods

기타 Pandas 문자열 메서드

Method	Description	EngDescription
`get()`	각 요소에 인덱스를 지정	Index each element
`slice()`	각 요소에 슬라이스를 적용	Slice each element
`slice_replace()`	각 요소의 슬라이스를 전달된 값으로 대체	Replace slice in each element with passed value
`cat()`	문자열을 연결	Concatenate strings
`repeat()`	값을 반복	Repeat values
`normalize()`	문자열의 유니코드 형태를 반환	Return Unicode form of string
`pad()`	문자열 왼쪽, 오른쪽, 또는 양쪽에 공백을 추가	Add whitespace to left, right, or both sides of strings
`wrap()`	긴 문자열을 주어진 너비보다 짧은 길이의 여러 줄로 나눔	Split long strings into lines with length less than a given width
`join()`	Series의 각 요소에 있는 문자열을 전달된 구분자와 결합	Join strings in each element of the Series with passed separator
`get_dummies()`	DataFrame으로 가변수(dummy variable)를 추출	extract dummy variables as a dataframe

벡터화된 항목의 접근 및 슬라이싱 Vectorized item access and slicing

  
# df.str.slice(0, 3) 과 동일
monte.str[0:3]

  Gra
  Joh
  Ter
  Eri
  Ter
  Mic
dtype: object

  
# split --> 반환한 배열의 요소에 접근
# get --> 각 요소의 성 추출
monte.str.split().str.get(-1)

  Chapman
   Cleese
  Gilliam
     Idle
    Jones
    Palin
dtype: object

지시 변수 Indicator variables

  
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

	name	info
0	Graham Chapman	B\|C\|D
1	John Cleese	B\|D
2	Terry Gilliam	A\|C
3	Eric Idle	B\|D
4	Terry Jones	B\|C
5	Michael Palin	B\|C\|D

  
# 지시 변수를 DataFrame으로 나누기
full_monte['info'].str.get_dummies('|')

	A	B	C	D
0	0	1	1	1
1	0	1	0	1
2	1	0	1	0
3	0	1	0	1
4	0	1	1	0
5	0	1	1	1

Python Data Science

Pandas

벡터화된 문자열 연산 Working_With_Strings

Pandas 문자열 연산 Pandas String Operations

Tables of Pandas String Methods

Methods similar to Python string methods

Pandas str methods

Methods using regular expressions

Pandas 메서드와 파이썬 re 모듈 함수 사이의 매핑

Miscellaneous methods

기타 Pandas 문자열 메서드

벡터화된 항목의 접근 및 슬라이싱 Vectorized item access and slicing

지시 변수 Indicator variables

인기 태그