Jupyter Notebook 是一个开源的交互式开发工具,可以在 Web 界面中即时编写和执行代码,并实时查看结果,更多内容介绍可参考 Jupyter 官方文档。
Jupyter 支持集成多种 kernel,本文将介绍自建 Jupyter 如何构建高效的与 EMR Serverless Spark 交互的开发环境。
pip3 install 'jupyter_server>=2.14.0' 'notebook>=7.2.0'
export KERNEL_USERNAME=notebook export KERNEL_LAUNCH_TIMEOUT=300 export KERNEL_EXTRA_SPARK_OPTS="--conf serverless.spark.access.key={your_access_key} --conf serverless.spark.secret.key={your_secret_key} --conf las.execute.queuename={your_emr_queue} --conf spark.dynamicAllocation.maxExecutors=50" jupyter notebook --ip 0.0.0.0 --port {jupyter_port} --allow-root --gateway-url=http://{jupyter_gateway_endpoint} --GatewayClient.auth_token={jupyter_gateway_token}
涉及参数说明如下:
参数 | 是否必填 | 说明 | |
---|---|---|---|
KERNEL_USERNAME | 是 | 写为"notebook",不需要改 | |
KERNEL_LAUNCH_TIMEOUT | 是 | kernel 启动超时时间 | |
KERNEL_EXTRA_SPARK_OPTS | 是 | 用来设置额外参数,支持 EMR Serverless 参数和 Spark 开源参数(参照常见参数),设置格式为
| |
jupyter_port | 是 | 指定启动的端口号 | |
jupyter_gateway_endpoint | 是 | Jupyter 网关服务地址,参考如下:
| |
jupyter_gateway_token | 是 | Jupyter 网关令牌,由 EMR Serverless 提供 |
Notebook 启动后,您可以看到如下日志信息:
[I 2025-05-22 20:11:54.327 ServerApp] jupyter_lsp | extension was successfully linked. [I 2025-05-22 20:11:54.330 ServerApp] jupyter_server_terminals | extension was successfully linked. [I 2025-05-22 20:11:54.333 ServerApp] jupyterlab | extension was successfully linked. [I 2025-05-22 20:11:54.337 ServerApp] notebook | extension was successfully linked. [I 2025-05-22 20:11:54.507 ServerApp] notebook_shim | extension was successfully linked. [I 2025-05-22 20:11:54.519 ServerApp] notebook_shim | extension was successfully loaded. [I 2025-05-22 20:11:54.521 ServerApp] jupyter_lsp | extension was successfully loaded. [I 2025-05-22 20:11:54.522 ServerApp] jupyter_server_terminals | extension was successfully loaded. [I 2025-05-22 20:11:54.523 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.10/site-packages/jupyterlab [I 2025-05-22 20:11:54.523 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab [I 2025-05-22 20:11:54.524 LabApp] Extension Manager is 'pypi'. [I 2025-05-22 20:11:54.547 ServerApp] jupyterlab | extension was successfully loaded. [I 2025-05-22 20:11:54.549 ServerApp] notebook | extension was successfully loaded. [I 2025-05-22 20:11:54.550 ServerApp] Serving notebooks from local directory: /root [I 2025-05-22 20:11:54.550 ServerApp] Jupyter Server 2.16.0 is running at: [I 2025-05-22 20:11:54.550 ServerApp] http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784 [I 2025-05-22 20:11:54.550 ServerApp] http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784 [I 2025-05-22 20:11:54.550 ServerApp] Kernels will be managed by the Gateway server running at: [I 2025-05-22 20:11:54.550 ServerApp] http://115.190.41.195:80 [I 2025-05-22 20:11:54.550 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 2025-05-22 20:11:54.553 ServerApp] To access the server, open this file in a browser: file:///root/.local/share/jupyter/runtime/jpserver-403812-open.html Or copy and paste one of these URLs: http://localhost:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784 http://127.0.0.1:8123/tree?token=e978588f8f0a685013be07d1c110af0f246e2b3b3f101784 Refresh (1 sec) http://localhost:8123/tree?token= e978588f8f0a685013be07d1c110af0f246e2b3b3f101784 This page should redirect you to a Jupyter application. If it doesn't, click here to go to Jupyter.
私有网络-私有连接-终端节点-创建终端节点-选择服务名称添加-验证-配置私有网络-不启用 DNS 名称。
notebook-eg-clb
为 Jupyter Server 启动时配置的 jupyter_gateway_endpoint
。
参数类型 | 参数key | 说明 | 默认值 |
---|---|---|---|
Driver 相关 |
| Driver 堆内内存大小 | 12g |
| Driver 堆外内存大小 | 4g | |
| Driver 程序的 CPU 核心数 | 4 | |
| Driver 返回结果的最大大小 | 3g | |
| Driver JVM 的参数 | ||
Executor 相关 |
| 每个 Executor 的 CPU 核心数 | 4 |
| 每个 Executor 的内存大小 | 12g | |
| 每个 Executor 的堆外内存大小 | 4g | |
| Executor 实例数(默认开启了dynamic,无需设置此参数) | 1 | |
dynamic相关 |
| 是否开启 dynamic | true |
| 最少 Executor 个数 | 1 | |
| 最大 Executor 个数 | 30 | |
文件读并行度 |
| Map 单个 task 的文件大小 | 268435456 |
| Shuffle 操作时生成的分区数 | 200 | |
AQE |
| Adaptive execution 开关,包含自动调整并行度,解决数据倾斜等优化 | true |
| 动态最大的并行度,对于 shuffle 量大的任务适当增大可以减少每个 task 的数据量,如 1024 | 500 | |
| AQE 动态调整 shuffle 分区数时的下限 | 5 | |
其他开源Spark参数 | |||
EMR Serverless参数 |
| 您的账号 Access Key | |
| 您的账号 Secret Key | ||
| 指定一个TOS桶,用来将您本地的jar文件上传到此桶中,Spark任务启动时将从该tos桶加载jar文件。如您的文件已存在TOS桶中,则此参数可不填写 | ||
| 逗号分隔的jar文件路径,Spark运行时将加载这些jar文件,可支持本地文件或tos文件 | ||
| 逗号分隔的archive文件,将被加载到Spark运行时的工作目录下,可支持本地文件或tos文件 | ||
| 逗号分隔的其他文件,将被加载到Spark运行时的工作目录下,可支持本地文件或tos文件 |
EMR Serverless Spark Kernel
。说明
当状态变为 Idle 时代表启动成功,此时可以执行交互式查询了。
说明
使用 PySpark 时,如下系统变量已经默认初始化,不要修改:spark、sc、sql。
from pyspark.sql import SparkSession spark = SparkSession.builder.enableHiveSupport().getOrCreate() sc = spark.sparkContext sql = spark.sql
EMR Serverless Spark
任务。spark.conf.get("spark.app.id")
返回结果 bq-500785400
中 500785400
即为EMR Serverless
任务ID,进入EMR 控制台 > 作业管理 > Serverless 作业,搜索【实例ID】即可得到对应任务,您可查看任务日志、Spark UI
等内容。