Spark on AWS EKS
This document analyzes Spark Application operation in AWS EKS Cluster. There are two ways to run Spark Applications in AWS EKS Cluster: using spark-submit CLI and Spark Operator provided by Spark, and using StartJobRun API provided by EMR on EKS.
1. spark-submit CLI & Spark Operator
In AWS EKS, Spark Applications can be run using spark-submit CLI and Spark Operator, just like in general Kubernetes Clusters. In this case, the Architecture and operation method are the same as using spark-submit CLI and Spark Operator in general Kubernetes Clusters, as described in the following Link.
However, in AWS EKS, it is recommended to use EMR on EKS Spark Container Image as the Container Image for Driver and Executor Pods. EMR on EKS Spark Container Image contains Optimized Spark optimized for EKS environments, showing better performance compared to Open Source Spark, and includes AWS-related Libraries and Spark Connectors listed below.
- EMRFS S3-optimized committer
- Spark Connector for AWS Redshift : Used when accessing AWS Redshift from Spark Applications
- Spark Library for AWS SageMaker : Data stored in Spark Application’s DataFrame can be directly used for Training through AWS SageMaker
EMR on EKS Spark Container Image is publicly available at Public AWS ECR. When using unique Libraries and Spark Connectors in Spark Applications, Custom Container Images must be built, and in this case, it is also recommended to use EMR on EKS Spark Container Image as the Base Image.
2. StartJobRun API
StartJobRun API is an API for submitting Spark Jobs in EMR on EKS environments. To use StartJobRun API, a Virtual Cluster, which is a virtual Resource managed by AWS EMR, must be created. To create Virtual Cluster, one Namespace existing in EKS Cluster is needed. Multiple Namespaces can be created in one EKS Cluster, and multiple Virtual Clusters can be mapped to each Namespace, allowing multiple Virtual Clusters to be operated in one EKS Cluster.
![[Figure 1] Spark on AWS EKS Architecture with StartJobRun API](/blog-software/docs/theory-analysis/spark-on-aws-eks/images/spark-aws-eks-architecture-startjobrun-api.png)
[Figure 1] Spark on AWS EKS Architecture with StartJobRun API
[Figure 1] shows the Architecture when submitting Spark Jobs through StartJobRun API to an EKS Cluster with one Virtual Cluster. When StartJobRun API is called, a job-runner Pod is created in the Namespace mapped to Virtual Cluster, and spark-submit CLI runs inside job-runner Pod. That is, StartJobRun API method also uses spark-submit CLI internally to submit Spark Jobs.
$ aws emr-containers start-job-run \
--virtual-cluster-id [virtual-cluster-id] \
--name=pi \
--region ap-northeast-2 \
--execution-role-arn arn:aws:iam::[account-id]:role/ts-eks-emr-eks-emr-cli \
--release-label emr-6.8.0-latest \
--job-driver '{
"sparkSubmitJobDriver":{
"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
"sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
}
}'[Shell 1] shows an example of submitting Spark Jobs through StartJobRun API using aws CLI. You can see that Virtual Cluster, the name of submitted Spark Job, AWS Region, Role for Spark Job execution, and Spark-related settings are specified. You can also see that --conf Parameters passed through spark-submit CLI are set in the sparkSubmitParameters item.
| |
The spark-submit CLI inside job-runner Pod obtains various configuration information needed for Spark Job creation through ConfigMap-based Files attached to job-runner Pod. ConfigMaps are created by AWS EMR before job-runner Pod is created, according to StartJobRun API settings. Configuration information includes settings related to Driver Pod and Executor Pod. When [Shell 1] command is executed, 3 ConfigMaps [File 1], [File 2], [File 3] are created.
[File 1] is the spark-defaults.conf ConfigMap for passing Spark Job settings to spark-submit CLI, [File 2] is the ConfigMap for Pod Template to be passed to spark-submit CLI, and [File 3] is the ConfigMap for fluentd to be set in Driver Pod. When Spark Jobs are submitted through StartJobRun API, fluentd Sidecar Container is always created in Driver Pod. The reason is that spark-submit CLI creates fluentd Container as Driver’s Sidecar Container through spark-submit CLI’s Pod Template functionality using [File 1], [File 2], [File 3] ConfigMaps.
Looking at [File 3] fluentd ConfigMap settings, you can see that Event Logs generated from Driver Pod are stored in prod.ap-northeast-2.appinfo.src Bucket. appinfo.src Bucket is a Bucket managed by AWS EMR, and is integrated with Spark History Server managed by EMR, allowing users to check History of Spark Jobs submitted through SparkJobRun API. Of course, it is also possible to set Event Logs to be stored at a path desired by users by specifying --conf spark.eventLog.dir=s3a://[s3-bucket] setting.
| |
| |
The spark-submit CLI inside job-runner Pod obtains various configuration information needed for Spark Job creation through ConfigMap-based Files attached to job-runner Pod. ConfigMaps are created by AWS EMR before job-runner Pod is created, according to StartJobRun API settings. Configuration information includes settings related to Driver Pod and Executor Pod. When [Shell 1] command is executed, 3 ConfigMaps [File 1], [File 2], [File 3] are created.
[File 1] is the spark-defaults.conf ConfigMap for passing Spark Job settings to spark-submit CLI, [File 2] is the ConfigMap for Pod Template to be passed to spark-submit CLI, and [File 3] is the ConfigMap for fluentd to be set in Driver Pod. When Spark Jobs are submitted through StartJobRun API, fluentd Sidecar Container is always created in Driver Pod. The reason is that spark-submit CLI creates fluentd Container as Driver’s Sidecar Container through spark-submit CLI’s Pod Template functionality using [File 1], [File 2], [File 3] ConfigMaps.
Looking at [File 3] fluentd ConfigMap settings, you can see that Event Logs generated from Driver Pod are stored in prod.ap-northeast-2.appinfo.src Bucket. appinfo.src Bucket is a Bucket managed by AWS EMR, and is integrated with Spark History Server managed by EMR, allowing users to check History of Spark Jobs submitted through SparkJobRun API. Of course, it is also possible to set Event Logs to be stored at a path desired by users by specifying --conf spark.eventLog.dir=s3a://[s3-bucket] setting.
$ aws emr-containers start-job-run \
--virtual-cluster-id jk518skp01ys9ka8b0npx9nt0 \
--name=pi-logs \
--region ap-northeast-2 \
--execution-role-arn arn:aws:iam::[account-id]:role/ts-eks-emr-eks-emr-cli \
--release-label emr-6.8.0-latest \
--job-driver '{
"sparkSubmitJobDriver":{
"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.memory=512M --conf spark.executor.instances=1 --conf spark.executor.memory=512M --conf spark.executor.cores=1"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"persistentAppUI": "ENABLED",
"cloudWatchMonitoringConfiguration": {
"logGroupName": "spark-startjobrun",
"logStreamNamePrefix": "pi-logs"
},
"s3MonitoringConfiguration": {
"logUri": "s3://ssup2-spark/startjobrun/"
}
}
}'StartJobRun API also provides functionality to easily send stdout/stderr of job-runner, driver, executor Pods to CloudWatch or S3. [Shell 2] shows an example of submitting Spark Jobs through StartJobRun API using aws CLI with Logging settings. Compared to [Shell 1], you can see that monitoringConfiguration setting is added, and CloudWatch and S3 settings exist under it respectively.
| |
When [Shell 2] command is executed, 3 ConfigMaps [File 4], [File 5], [File 6] are created. You can see that fluentd is configured to run not only in job-runner Pod but also in Driver Pod and Executor Pod, and fluentd running in Driver Pod and Executor Pod is configured to send stdout/stderr to CloudWatch or S3.
$ aws emr-containers start-job-run \
--virtual-cluster-id jk518skp01ys9ka8b0npx9nt0 \
--name=pi \
--region ap-northeast-2 \
--execution-role-arn arn:aws:iam::[account-id]:role/ts-eks-emr-eks-emr-cli \
--release-label emr-6.8.0-latest \
--job-driver '{
"sparkSubmitJobDriver":{
"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
"sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
}
}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"job-start-timeout":"1800",
"spark.ui.prometheus.enabled":"true",
"spark.executor.processTreeMetrics.enabled":"true",
"spark.kubernetes.driver.annotation.prometheus.io/scrape":"true",
"spark.kubernetes.driver.annotation.prometheus.io/path":"/metrics/executors/prometheus/",
"spark.kubernetes.driver.annotation.prometheus.io/port":"4040"
}
}
]
}'Through StartJobRun API, various settings that can be configured in spark-submit CLI can be set identically. [Shell 3] shows an example for performing Monitoring with Prometheus.
2.1. with ACK EMR Container Controller
![[Figure 2] Spark on AWS EKS Architecture with ACK EMR Container Controller](/blog-software/docs/theory-analysis/spark-on-aws-eks/images/spark-aws-eks-architecture-ack-emr-container-controller.png)
[Figure 2] Spark on AWS EKS Architecture with ACK EMR Container Controller
AWS provides ACK EMR Container Controller to enable submitting Spark Jobs based on StartJobRun API using Kubernetes Objects. [Figure 2] shows the process of submitting Spark Jobs through StartJobRun API based on ACK EMR Container Controller.
When ACK EMR Container Controller is installed in AWS EKS Cluster, two Custom Resources Virtual Cluster and Job Run become available. Virtual Cluster is a Custom Resource used to configure Virtual Cluster of EMR on EKS for a specific Namespace of AWS EKS Cluster where ACK EMR Container Controller is installed, and Job Run is a Custom Resource used when submitting Spark Jobs through StartJobRun API.
| |
| |
[File 7] shows an example of a simple Job Run, and [File 8] shows an example of a Job Run with Logging settings applied. Looking at the configuration values, you can see that options set through aws CLI can be set identically in Job Run.
3. References
- EMR on EKS Container Image : https://gallery.ecr.aws/emr-on-eks
- StartJobRun Parameter : https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-CLI.html#emr-eks-jobs-parameters