EMR Serverless Jupyter Notebook 最佳实践--E-MapReduce-火山引擎

文档中心

导航

EMR Serverless Jupyter Notebook 最佳实践

最近更新时间：2025.06.23 14:18:15首次发布时间：2025.06.23 14:18:15

Jupyter Notebook 是一个开源的交互式开发工具，可以在 Web 界面中即时编写和执行代码，并实时查看结果，更多内容介绍可参考 Jupyter 官方文档。
Jupyter 支持集成多种 kernel，本文将介绍自建 Jupyter 如何构建高效的与 EMR Serverless Spark 交互的开发环境。

部署 Jupyter

安装 Jupyter Notebook

pip3 install 'jupyter_server>=2.14.0' 'notebook>=7.2.0'

启动 Jupyter Notebook

export KERNEL_USERNAME=notebook
export KERNEL_LAUNCH_TIMEOUT=300
export KERNEL_EXTRA_SPARK_OPTS="--conf serverless.spark.access.key={your_access_key} --conf serverless.spark.secret.key={your_secret_key} --conf las.execute.queuename={your_emr_queue}  --conf spark.dynamicAllocation.maxExecutors=50"

jupyter notebook --ip 0.0.0.0 --port {jupyter_port} --allow-root --gateway-url=http://{jupyter_gateway_endpoint} --GatewayClient.auth_token={jupyter_gateway_token}

涉及参数说明如下：

参数	是否必填	说明
KERNEL_USERNAME	是	写为"notebook"，不需要改
KERNEL_LAUNCH_TIMEOUT	是	kernel 启动超时时间
KERNEL_EXTRA_SPARK_OPTS	是	用来设置额外参数，支持 EMR Serverless 参数和 Spark 开源参数(参照常见参数)，设置格式为`--conf k1=v1 --conf k2=v2`，可填参数如下： `serverless.spark.access.key`：必填，您的账号 Access Key `serverless.spark.secret.key`：必填，您的账号 Secret Key `las.execute.queuename`：EMR Serverless队列名称 `spark.dynamicAllocation.minExecutors`：最小executor个数，默认1 `spark.dynamicAllocation.initialExecutors`：初始化executor个数，默认1 `spark.dynamicAllocation.maxExecutors`：最大executor个数，默认30
jupyter_port	是	指定启动的端口号
jupyter_gateway_endpoint	是	Jupyter 网关服务地址，参考如下：华北公网 endpoint：https://notebook-eg.emr.cn-beijing.volces.com 华北内网 endpoint：http://ep-xxxx.privatelink.volces.com，详细内容请参考(可选)配置内网Endpoint
jupyter_gateway_token	是	Jupyter 网关令牌，由 EMR Serverless 提供

Notebook 启动后，您可以看到如下日志信息：

[I 2025-05-22 20:11:54.327 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2025-05-22 20:11:54.330 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2025-05-22 20:11:54.333 ServerApp] jupyterlab | extension was successfully linked.
[I 2025-05-22 20:11:54.337 ServerApp] notebook | extension was successfully linked.
[I 2025-05-22 20:11:54.507 ServerApp] notebook_shim | extension was successfully linked.
[I 2025-05-22 20:11:54.519 ServerApp] notebook_shim | extension was successfully loaded.
[I 2025-05-22 20:11:54.521 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2025-05-22 20:11:54.522 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2025-05-22 20:11:54.523 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.10/site-packages/jupyterlab
[I 2025-05-22 20:11:54.523 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 2025-05-22 20:11:54.524 LabApp] Extension Manager is 'pypi'.
[I 2025-05-22 20:11:54.547 ServerApp] jupyterlab | extension was successfully loaded.
[I 2025-05-22 20:11:54.549 ServerApp] notebook | extension was successfully loaded.
[I 2025-05-22 20:11:54.550 ServerApp] Serving notebooks from local directory: /root
[I 2025-05-22 20:11:54.550 ServerApp] Jupyter Server 2.16.0 is running at:
[I 2025-05-22 20:11:54.550 ServerApp] http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
[I 2025-05-22 20:11:54.550 ServerApp]     http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
[I 2025-05-22 20:11:54.550 ServerApp] Kernels will be managed by the Gateway server running at:
[I 2025-05-22 20:11:54.550 ServerApp] http://115.190.41.195:80
[I 2025-05-22 20:11:54.550 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2025-05-22 20:11:54.553 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/jpserver-403812-open.html
    Or copy and paste one of these URLs:
        http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
        http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784
Refresh (1 sec) http://localhost:8123/tree?token=
e978588f8f0a685013be07d1c110af0f246e2b3b3f101784

This page should redirect you to a Jupyter application. If it doesn't, click
here to go to Jupyter.

(可选)配置内网 Endpoint

创建终端节点

私有网络-私有连接-终端节点-创建终端节点-选择服务名称添加-验证-配置私有网络-不启用 DNS 名称。

选择可用服务：notebook-eg-clb
私有网络&子网：建议与 Jupyter Server 所在 ECS 节点的相同私有网络和子网

查看终端节点域名：

为 Jupyter Server 启动时配置的 jupyter_gateway_endpoint。

安全组入方向规则：需要对 Jupyter Server 部署的 ECS 节点开放。

常见参数

参数类型	参数key	说明	默认值
Driver 相关	`spark.driver.memory`	Driver 堆内内存大小	12g
	`spark.driver.memoryOverhead`	Driver 堆外内存大小	4g
	`spark.driver.cores`	Driver 程序的 CPU 核心数	4
	`spark.driver.maxResultSize`	Driver 返回结果的最大大小	3g
	`spark.driver.extraJavaOptions`	Driver JVM 的参数
Executor 相关	`spark.executor.cores`	每个 Executor 的 CPU 核心数	4
	`spark.executor.memory`	每个 Executor 的内存大小	12g
	`spark.executor.memoryOverhead`	每个 Executor 的堆外内存大小	4g
	`spark.executor.instances`	Executor 实例数(默认开启了dynamic，无需设置此参数)	1
dynamic相关	`spark.dynamicAllocation.enabled`	是否开启 dynamic	true
	`spark.dynamicAllocation.minExecutors`	最少 Executor 个数	1
	`spark.dynamicAllocation.maxExecutors`	最大 Executor 个数	30
文件读并行度	`spark.sql.files.maxPartitionBytes`	Map 单个 task 的文件大小	268435456
文件读并行度	`spark.sql.shuffle.partitions`	Shuffle 操作时生成的分区数	200
AQE	`spark.sql.adaptive.enabled`	Adaptive execution 开关，包含自动调整并行度，解决数据倾斜等优化	true
	`spark.sql.adaptive.maxNumPostShufflePartitions`	动态最大的并行度，对于 shuffle 量大的任务适当增大可以减少每个 task 的数据量，如 1024	500
	`spark.sql.adaptive.minNumPostShufflePartitions`	AQE 动态调整 shuffle 分区数时的下限	5
其他开源Spark参数	请参考官网内容
EMR Serverless参数	`serverless.spark.access.key`	您的账号 Access Key
	`serverless.spark.secret.key`	您的账号 Secret Key
	`serverless.spark.tos.bucket`	指定一个TOS桶，用来将您本地的jar文件上传到此桶中，Spark任务启动时将从该tos桶加载jar文件。如您的文件已存在TOS桶中，则此参数可不填写
	`spark.jars`	逗号分隔的jar文件路径，Spark运行时将加载这些jar文件，可支持本地文件或tos文件
	`spark.archives`	逗号分隔的archive文件，将被加载到Spark运行时的工作目录下，可支持本地文件或tos文件
	`spark.files`	逗号分隔的其他文件，将被加载到Spark运行时的工作目录下，可支持本地文件或tos文件

Jupyter 提交 EMR Serverless Spark 任务

启动 EMR Serverless Spark Kernel

在主页点击【View】->【Open JupyterLab】，访问 Jupyter。

新建EMR Serverless Spark Kernel。

启动状态确认：可以通过右上角状态按钮确认是否启动。

说明

当状态变为 Idle 时代表启动成功，此时可以执行交互式查询了。

使用 PySpark 进行交互式查询

说明

使用 PySpark 时，如下系统变量已经默认初始化，不要修改：spark、sc、sql。

from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sql = spark.sql

查询库表信息。

读写表。

查询当前 kernel 会话对应的EMR Serverless Spark任务。

spark.conf.get("spark.app.id")

返回结果 bq-500785400 中 500785400即为EMR Serverless任务ID，进入EMR 控制台 > 作业管理 > Serverless 作业，搜索【实例ID】即可得到对应任务，您可查看任务日志、Spark UI等内容。