[AWS]EMR(spark) Job 실행하기

DevOps

[AWS]EMR(spark) Job 실행하기

Hazel_song 2022. 6. 9. 12:10

728x90

EMR on EKS : spark job submit

S3 -> EMR(spark) -> S3

1. cluster 정보 확인

aws emr-containers list-virtual-clusters

2. 기본적인 정보 등록

export EMR_EKS_CLUSTER_ID=<virtual-cluster-id>
export EMR_EKS_EXECUTION_ARN=<arn\:aws\:iam::xxxxx\:role/EMR_EKS_Job_Execution_Role>
export S3_BUCKET=<S3Bucket>

-> role에는 이전에 emr on eks 세팅할 때 만들었던 role 정보(EMRContainers-JobExecutionRole)를 입력

-> S3 bucket은 필수정보는 아닌듯하다. 작업을 위한 pyspark파일이 담긴 s3 경로인 것이다.

3. pyspark 코드를 입력해서 위의 s3 버킷에등록

예시코드

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()
    df = spark.read.csv("가공할 데이터가 담긴 S3 버킷 path")

    df.write.option("header", "true").csv("가공된 데이터가 저장될 S3 버킷 path")

    spark.stop()

4. job submit

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": ${S3Bucket},
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'

5. 제출된 job 확인

a. 실패한 job(failed 상태) error log 확인

aws emr-containers describe-job-run --virtual-cluster-id cluster-id --id job-run-id

아래와 같은 메시지를 통해 에러 원인 파악 및 해결 가능

b. 성공한 job : running => completed 상태

aws emr on eks 콘솔에서 위와 같이 running 상태로 확인된 job은 eks에서도 pod로 생성되어서 실행되고 아래처럼 확인 가능함

실행이 완료되어서 completed 상태가 된 job은 다시 pod에서도 내려감.

그리고 아래같이 s3에 내가 지정한 path, bucket에 파일이 생성된 것을 확인할 수 있음.

6. 주요 명령어

a. 실행되고 있는 잡 확인

aws emr-containers list-job-runs --virtual-cluster-id <cluster-id>

b. 잡 자세하게 보기

aws emr-containers describe-job-run --virtual-cluster-id cluster-id --id job-run-id

c. 잡 실행 취소

aws emr-containers cancel-job-run --virtual-cluster-id cluster-id --id job-run-id

7. 로그 보기

- 로그 웹 ui 보기

-> submiitted 상태에서는 view logs가 안된다. 상태가 결정되고 나서 해당 버튼을 누르면 private cluster더라도 로그는 웹 ui를 제공하는 듯 하다.

참고자료

https://catalog.us-east-1.prod.workshops.aws/workshops/1f91e1d4-5587-40ff-8d5d-54fc86e0ddc1/en-US

Workshop Studio

catalog.us-east-1.prod.workshops.aws

https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html

Getting started - Amazon EMR

When creating an EKS cluster, make sure to use m5.xlarge as the instance type, or any other instance type with a higher CPU and memory. Using an instance type with lower CPU or memory than m5.xlarge may lead to job failure due to insufficient resources ava

docs.aws.amazon.com

https://programmer.ink/think/create-and-run-an-emr-on-eks-cluster.html

Create and run an EMR on EKS cluster

The creation of EMR on EKS is completely command-line driven. At present, there is no corresponding UI interface to complete relevant operations. This article will demonstrate how to create and run an EMR on EKS cluster from the command line. The process o

programmer.ink

https://docs.aws.amazon.com/ko_kr/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-CLI.html#emr-eks-jobs-submit

다음을 사용하여 작업 실행 관리AWS CLI - Amazon EMR

EKS의 Amazon EMR은 Amazon S3 버킷을 생성할 수도 있습니다. Amazon S3 버킷을 사용할 수 없는 경우“s3:CreateBucket”IAM 정책의 권한입니다.

docs.aws.amazon.com

https://docs.aws.amazon.com/ko_kr/emr/latest/ReleaseGuide/emr-spark-s3select.html

S3 Spark와 함께 를 사용하여 쿼리 성능 향상 - Amazon EMR

이 페이지에 작업이 필요하다는 점을 알려 주셔서 감사합니다. 실망시켜 드려 죄송합니다. 잠깐 시간을 내어 설명서를 향상시킬 수 있는 방법에 대해 말씀해 주십시오.

docs.aws.amazon.com

https://www.youtube.com/watch?v=2UMz72NRZss&list=PLUe6KRx8LhLpJ8CyNHewFYukWm7sQyQrM&index=1