PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

programing

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

lastmoon 2023. 7. 16. 17:44

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

열이 String인 데이터 프레임이 있습니다.저는 PySpark에서 컬럼 타입을 Double 타입으로 변경하고 싶었습니다.

다음은 제가 한 방법입니다.

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

로지스틱 회귀 분석을 실행하는 동안 오류가 발생하여 문제가 발생한 이유가 여기에 있는지 알고 싶습니다.

여기에는 UDF가 필요하지 않습니다. Column 인스턴스가 다음인 메서드를 이미 제공합니다.

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

또는 짧은 문자열:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

여기서 표준 문자열 이름(다른 변형도 지원할 수 있음)은 다음과 같습니다.simpleString값입니다. 원자 유형의 경우:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

그리고 예를 들어 복잡한 유형.

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

열 이름을 유지하고 입력 열과 동일한 이름을 사용하여 열이 추가되지 않도록 합니다.

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

주어진 답은 문제를 해결하기에 충분하지만, 저는 스파크의 새로운 버전을 소개할 수 있는 다른 방법을 공유하고 싶습니다(잘 모르겠습니다). 그래서 주어진 답은 그것을 참조하십시오.

다음을 사용하여 스파크 문에 있는 열에 도달할 수 있습니다.col("colum_name")키워드:

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

PySpark 버전:

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

해결책은 간단했습니다.

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

사용:

df1.select(col('show').cast("Float").alias('label')).show()

또는

df1.selectExpr("cast(show AS FLOAT) as label").show()

Pyspark 버전에 따라 다른 답변과 관련된 한 가지 문제는 다음과 같습니다.withColumn성능 문제는 적어도 v2.4.4에서 발견되었습니다(이 스레드 참조).스파크 문서에는 다음과 같은 내용이 있습니다.withColumn:

이 방법은 내부적으로 투영을 도입합니다.따라서 루프를 통해 여러 번 호출하면 여러 열을 추가하기 위해 큰 계획이 생성되어 성능 문제와 StackOverflow가 발생할 수 있습니다.예외.이 문제를 방지하려면 한 번에 여러 열과 함께 선택을 사용합니다.

권장되는 사용 방법을 달성하는 한 가지 방법select대신 일반적으로 다음과 같습니다.

from pyspark.sql.types import *
from pyspark.sql import functions as F

cols_to_fix = ['show']
other_cols = [col for col in joindf.columns if not col in cols_to_fix]
joindf = joindf.select(
    *other_cols,
    F.col('show').cast(DoubleType())
)

언급URL : https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark

'programing' 카테고리의 다른 글

Python에서 콘솔 출력 바꾸기 (0)	2023.07.16
기본 키와 대리 키의 차이점은 무엇입니까? (0)	2023.07.16
virtualenv --no-site-discovery와 pip은 여전히 글로벌 패키지를 찾고 있습니까? (0)	2023.07.16
세 자리 숫자에 대한 열 지도에서 과학적 표기법을 보여주는 Seaborn (0)	2023.07.16
Python에서 서로 다른 라인 스타일로 주 그리드라인과 부 그리드라인을 만드는 방법 (0)	2023.07.16

현재글PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

spring, jquery, json, sql-server, C, angularjs, ReactJS, CSS, ajax, spring-boot, MariaDB, Android, Python, Excel, mongodb, git, Oracle, WordPress, asp.net, PowerShell,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

lastmoon

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

PySpark에서 데이터 프레임 열을 String 유형에서 Double 유형으로 변경하는 방법은 무엇입니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바