You need to enable JavaScript to run this app.
导航
EMR Serverless Jupyter Notebook 最佳实践
最近更新时间:2025.06.23 14:18:15首次发布时间:2025.06.23 14:18:15
我的收藏
有用
有用
无用
无用

Jupyter Notebook 是一个开源的交互式开发工具,可以在 Web 界面中即时编写和执行代码,并实时查看结果,更多内容介绍可参考 Jupyter 官方文档
Jupyter 支持集成多种 kernel,本文将介绍自建 Jupyter 如何构建高效的与 EMR Serverless Spark 交互的开发环境。

部署 Jupyter

安装 Jupyter Notebook

pip3 install 'jupyter_server>=2.14.0' 'notebook>=7.2.0'

启动 Jupyter Notebook

export KERNEL_USERNAME=notebook
export KERNEL_LAUNCH_TIMEOUT=300
export KERNEL_EXTRA_SPARK_OPTS="--conf serverless.spark.access.key={your_access_key} --conf serverless.spark.secret.key={your_secret_key} --conf las.execute.queuename={your_emr_queue}  --conf spark.dynamicAllocation.maxExecutors=50"

jupyter notebook --ip 0.0.0.0 --port {jupyter_port} --allow-root --gateway-url=http://{jupyter_gateway_endpoint} --GatewayClient.auth_token={jupyter_gateway_token}

涉及参数说明如下:

参数

是否必填

说明

KERNEL_USERNAME

写为"notebook",不需要改

KERNEL_LAUNCH_TIMEOUT

kernel 启动超时时间

KERNEL_EXTRA_SPARK_OPTS

用来设置额外参数,支持 EMR Serverless 参数和 Spark 开源参数(参照常见参数),设置格式为--conf k1=v1 --conf k2=v2,可填参数如下:

  • serverless.spark.access.key:必填,您的账号 Access Key
  • serverless.spark.secret.key:必填,您的账号 Secret Key
  • las.execute.queuename:EMR Serverless队列名称
  • spark.dynamicAllocation.minExecutors:最小executor个数,默认1
  • spark.dynamicAllocation.initialExecutors:初始化executor个数,默认1
  • spark.dynamicAllocation.maxExecutors:最大executor个数,默认30

jupyter_port

指定启动的端口号

jupyter_gateway_endpoint

Jupyter 网关服务地址,参考如下:

  • 华北公网 endpoint:https://notebook-eg.emr.cn-beijing.volces.com
  • 华北内网 endpoint:http://ep-xxxx.privatelink.volces.com,详细内容请参考(可选)配置内网Endpoint

jupyter_gateway_token

Jupyter 网关令牌,由 EMR Serverless 提供

Notebook 启动后,您可以看到如下日志信息:

[I 2025-05-22 20:11:54.327 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2025-05-22 20:11:54.330 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2025-05-22 20:11:54.333 ServerApp] jupyterlab | extension was successfully linked.
[I 2025-05-22 20:11:54.337 ServerApp] notebook | extension was successfully linked.
[I 2025-05-22 20:11:54.507 ServerApp] notebook_shim | extension was successfully linked.
[I 2025-05-22 20:11:54.519 ServerApp] notebook_shim | extension was successfully loaded.
[I 2025-05-22 20:11:54.521 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2025-05-22 20:11:54.522 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2025-05-22 20:11:54.523 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.10/site-packages/jupyterlab
[I 2025-05-22 20:11:54.523 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 2025-05-22 20:11:54.524 LabApp] Extension Manager is 'pypi'.
[I 2025-05-22 20:11:54.547 ServerApp] jupyterlab | extension was successfully loaded.
[I 2025-05-22 20:11:54.549 ServerApp] notebook | extension was successfully loaded.
[I 2025-05-22 20:11:54.550 ServerApp] Serving notebooks from local directory: /root
[I 2025-05-22 20:11:54.550 ServerApp] Jupyter Server 2.16.0 is running at:
[I 2025-05-22 20:11:54.550 ServerApp] http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
[I 2025-05-22 20:11:54.550 ServerApp]     http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
[I 2025-05-22 20:11:54.550 ServerApp] Kernels will be managed by the Gateway server running at:
[I 2025-05-22 20:11:54.550 ServerApp] http://115.190.41.195:80
[I 2025-05-22 20:11:54.550 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2025-05-22 20:11:54.553 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/jpserver-403812-open.html
    Or copy and paste one of these URLs:
        http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
        http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
Refresh (1 sec) http://localhost:8123/tree?token=
e978588f8f0a685013be07d1c110af0f246e2b3b3f101784

This page should redirect you to a Jupyter application. If it doesn't, click
here to go to Jupyter.

(可选)配置内网 Endpoint

  • 创建终端节点

私有网络-私有连接-终端节点-创建终端节点-选择服务名称添加-验证-配置私有网络-不启用 DNS 名称。

  • 选择可用服务:notebook-eg-clb
  • 私有网络&子网:建议与 Jupyter Server 所在 ECS 节点的相同私有网络和子网
  • 查看终端节点域名:

为 Jupyter Server 启动时配置的 jupyter_gateway_endpoint

  • 安全组入方向规则:需要对 Jupyter Server 部署的 ECS 节点开放。

常见参数

参数类型

参数key

说明

默认值

Driver 相关

spark.driver.memory

Driver 堆内内存大小

12g

spark.driver.memoryOverhead

Driver 堆外内存大小

4g

spark.driver.cores

Driver 程序的 CPU 核心数

4

spark.driver.maxResultSize

Driver 返回结果的最大大小

3g

spark.driver.extraJavaOptions

Driver JVM 的参数

Executor 相关

spark.executor.cores

每个 Executor 的 CPU 核心数

4

spark.executor.memory

每个 Executor 的内存大小

12g

spark.executor.memoryOverhead

每个 Executor 的堆外内存大小

4g

spark.executor.instances

Executor 实例数(默认开启了dynamic,无需设置此参数)

1

dynamic相关

spark.dynamicAllocation.enabled

是否开启 dynamic

true

spark.dynamicAllocation.minExecutors

最少 Executor 个数

1

spark.dynamicAllocation.maxExecutors

最大 Executor 个数

30

文件读并行度

spark.sql.files.maxPartitionBytes

Map 单个 task 的文件大小

268435456

spark.sql.shuffle.partitions

Shuffle 操作时生成的分区数

200

AQE

spark.sql.adaptive.enabled

Adaptive execution 开关,包含自动调整并行度,解决数据倾斜等优化

true

spark.sql.adaptive.maxNumPostShufflePartitions

动态最大的并行度,对于 shuffle 量大的任务适当增大可以减少每个 task 的数据量,如 1024

500

spark.sql.adaptive.minNumPostShufflePartitions

AQE 动态调整 shuffle 分区数时的下限

5

其他开源Spark参数

请参考官网内容

EMR Serverless参数

serverless.spark.access.key

您的账号 Access Key

serverless.spark.secret.key

您的账号 Secret Key

serverless.spark.tos.bucket

指定一个TOS桶,用来将您本地的jar文件上传到此桶中,Spark任务启动时将从该tos桶加载jar文件。如您的文件已存在TOS桶中,则此参数可不填写

spark.jars

逗号分隔的jar文件路径,Spark运行时将加载这些jar文件,可支持本地文件或tos文件

spark.archives

逗号分隔的archive文件,将被加载到Spark运行时的工作目录下,可支持本地文件或tos文件

spark.files

逗号分隔的其他文件,将被加载到Spark运行时的工作目录下,可支持本地文件或tos文件

Jupyter 提交 EMR Serverless Spark 任务

启动 EMR Serverless Spark Kernel

  1. 在主页点击【View】->【Open JupyterLab】,访问 Jupyter。
  1. 新建EMR Serverless Spark Kernel
  1. 启动状态确认:可以通过右上角状态按钮确认是否启动。

说明

当状态变为 Idle 时代表启动成功,此时可以执行交互式查询了。

使用 PySpark 进行交互式查询

说明

使用 PySpark 时,如下系统变量已经默认初始化,不要修改:spark、sc、sql。

from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sql = spark.sql
  1. 查询库表信息。
  1. 读写表。
  1. 查询当前 kernel 会话对应的EMR Serverless Spark任务。
spark.conf.get("spark.app.id")

返回结果 bq-500785400500785400即为EMR Serverless任务ID,进入EMR 控制台 > 作业管理 > Serverless 作业,搜索【实例ID】即可得到对应任务,您可查看任务日志、Spark UI等内容。