部署EMR on VKE产品中Ray服务。
通过kubectl命令运行RayJob作业,需要安装kubectl工具(参考安装和设置 kubectl),且配置火山引擎VKE集群的连接信息(参考连接VKE集群)。
在火山引擎EMR on VKE产品中场景Ray服务后,便可以通过以下步骤提交RayJob:
编写RayJob配置文件:首先,创建一个YAML格式的配置文件,文件中定义RayJob的资源需求、Ray集群的配置、以及作业的入口点等信息。可以参考下面提供的RayJob Yaml模板。
应用RayJob配置:使用kubectl
命令行工具将RayJob的配置文件应用到火山引擎VKE集群中。打开终端或命令行界面,执行以下命令:
kubectl apply -f <RayJob配置文件的名称>
kubectl get rayjob
# 或者获取更详细的状态信息:
kubectl describe rayjob <RayJob名称>
kubectl
查看Pod的日志kubectl logs <RayJob生成的Pod名称>
如果RayJob配置中启用了Ingress,在EMR管控端找到相应集群,通过快速连接打开Ray Dashboard UI,查看作业执行信息。
shutdownAfterJobFinishes: true
),则需要手动删除RayJob和相关的Kubernetes资源。使用以下命令删除RayJob以及与之相关的所有资源:kubectl apply -f <RayJob配置文件的名称>
# 或使用下面的命令
kubectl delete rayjob <rayjob-name>
此外,如果你的RayJob需要与外部存储或服务进行交互,确保相应的存储类、PVC、Secrets等也已经在火山引擎VKE集群中预先配置好。
下面介绍下RayJob使用的Yaml模版,您可以根据不同场景进行增删和修改。
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc名称
spec:
storageClassName: ebs-essd #使用云盘 火上引擎还支持 NAS、CloudFS、TOS
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 40Gi #根据实际需要情况配置
---
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
annotations:
#此annotation必须要加 否则访问dashboard会报404
nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
entrypoint: python ***.py
# shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
shutdownAfterJobFinishes: true
# ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
ttlSecondsAfterFinished: 300
# rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
rayClusterSpec:
# 开启 ingress
enableIngress: true
rayVersion: '2.9.3' # should match the Ray version in the image of the containers
# If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
enableInTreeAutoscaling: true
autoscalerOptions:
# upscalingMode is "Default" or "Aggressive" or "Conservative"
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
# Default: Upscaling is not rate-limited.
# Aggressive: An alias for Default; upscaling is not rate-limited.
upscalingMode: Conservative
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
idleTimeoutSeconds: 60
# imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent).
imagePullPolicy: IfNotPresent
# Optionally specify the autoscaler container's securityContext.
securityContext: {}
env: []
envFrom: []
headGroupSpec:
# 开启 ingress
enableIngress: true
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
initContainers:
- name: git-clone
image: alpine/git
command:
- /bin/sh
- -c
- |
git clone 代码仓库地址 /app
volumeMounts:
- name: esb
mountPath: /app
containers:
- name: ray-head
image: 镜像地址
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
requests:
cpu: "1"
volumeMounts:
- name: esb
mountPath: /app
volumes:
- name: esb
# 代码放到 云盘 或者直接配置为 emptyDir 都可以
persistentVolumeClaim:
claimName: pvc名称
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: autoscaling-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
#pod template
template:
spec:
imagePullSecrets:
- name: 拉取镜像密钥
containers:
- name: ray-worker
image: 镜像仓库地址
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
resources:
limits:
cpu: "1"
nvidia.com/gpu: 1
requests:
cpu: "1"
nvidia.com/gpu: 1
关键参数说明:
参数名称 | 参数说明 |
---|---|
enableIngress | 是否开启Ingress。 |
enableInTreeAutoscaling | 是否开启弹性伸缩。 |
shutdownAfterJobFinishes | 任务完成后是否释放RayCluster |
ttlSecondsAfterFinished | 任务完成后多久时间释放RayCluster,单位为秒 |
idleTimeoutSeconds | 空闲worker缩容前的等待时间,单位为秒 |
storageClassName | 配置pvc的StorageClass。 目前支持ebs-essd/ebs-ssd等 |
Ray的镜像地址和镜像密钥可以参考镜像参考填写,也可以使用自定义镜像。