You need to enable JavaScript to run this app.
导航
基础使用
最近更新时间:2023.11.27 14:22:06首次发布时间:2022.09.22 16:55:45

快速开始 中成功送出了第一个 Primus 训练任务,现在您可以试着使用 Primus 进行分布式的 TensorFlow 训练任务吧!在这里会示范三种不同的 TensorFlow 分布式策略依序为 Single Node,MultiWorkerMirrored 以及 ParameterServer。

1 准备工作

由于 TensorFlow 训练需要训练资料以及 Python 环境,在这里您需要进行更多的准备工作!

# Change to yarn user
$ su --shell=/bin/bash - yarn

# Create the workspace
$ mkdir ~/primus-playground
$ cd ~/primus-playground
$ cp -r /usr/lib/emr/current/tensorflow_on_yarn/examples .

# Build the Python virtual environment
$ cd examples/shared/venv 
$ ./build.sh

# Prepare the workspace on HDFS and the datasets
$ cd ~/primus-playground/
$ hdfs dfs -mkdir mnist
$ hdfs dfs -mkdir mnist/models
$ hdfs dfs -put examples/shared/mnist/data mnist

注意

  1. 在教学里,会透过 pip instal 安装需要的 Python package 制作 Python 虚拟环境,因此需要将集群的 master node 绑定公网 IP。但是如果因为各种因素需要在本机制作一个 Python 虚拟环境,可以参考:高阶使用

  2. 同时 EMR DataScience 集群上已经安装了 tensorflow 以及 tensorflow-io 两个 Python package,因此如果日后的训练不需要其他的 Python package,在使用上可以跳过制作 Python 虚拟环境的步骤。

  3. 不同 EMR 版本中节点的域名命名方式可能不同,所以本章节示例代码中“emr-master-1”可参考 EMR 的域名规则做相应调整。


2 开始训练!

在一切准备工作就绪之后,您就可以开始分布式的 TensorFlow 训练了!

2.1 Single Node

首先您可以先来观察一下 Primus 训练配置,从配置中可以发现在设定上相较于 Hello Primus,多指定了训练资源,其中包含了 Primus virtual environent 跟训练脚本,同时有了更复杂的训练指令!

{
  "name": "primus_tensorflow_single",
  "files": [ 
    "examples/shared/venv/venv.tar.gz", // Python virtual environent
    "examples/tensorflow-single"        // 训练脚本资料夹路径
  ],
  "role": [
    {
      "roleName": "main",
      "num": 1, // 单点训练
      "vcores": 1,
      "memoryMb": 512,
      "jvmMemoryMb": 512,
      "command": "./tensorflow-single/main.sh venv.tar.gz", // 训练指令
      "successPercent": 100,
      "failover": {
        "commonFailoverPolicy": {
          "commonFailover": {
            "maxFailureTimes": 10,
            "maxFailurePolicy": "FAIL_ATTEMPT"
          }
        }
      }
    }
  ]
}

使用 primus-submit 提交训练!

# Submit Primus application
$ cd ~/primus-playground
$ primus-submit --primus_conf examples/tensorflow-single/primus_config.json
...
22/03/03 18:36:47 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID>
22/03/03 18:36:47 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1:8088/proxy/<YARN-APPLICATION-ID>/
22/03/03 18:36:57 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10010 ms.
22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: State: FINISHED  Progress: 100.0%
22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 18:38
22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED
...

# Observe YARN logs 
$ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN"
...
+ echo FIN
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
FIN
...

最后因为这个范例有将模型输出到 HDFS 上,所以您可以透过 Python 脚本测试模型的表现!

$ cd ~/primus-playground/examples/tensorflow-single
 
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server/
$ export HADOOP_HDFS_HOME=/usr/lib/emr/current/hadoop
$ export CLASSPATH=$(hadoop classpath --glob)

$ python3.9 evaluate.py \
  --mnist hdfs://emr-master-1:8020/user/yarn/mnist/data \
  --model hdfs://emr-master-1:8020/user/yarn/mnist/models/model-single
...
Model accuracy: [0.29252758622169495, 0.9218999743461609]
...

2.2 Multi Worker Mirrored

在提交 Primus 训练任务之前,您可以观察一下 Primus 训练配置。可以快速发现 Multi Worker Mirrored 训练的 Primus 训练配置和 Single Node 训练的 Primus 训练配置非常相似,主要的差异为Multi Worker Mirrored 的训练需要多个节点来完成 ,因此在 Primus 训练配置中,将 “num” 的节点数设定成 2。

{
  "name": "primus_tensorflow_multiworkermirrored",
  "maxAppAttempts": 1,
  "files": [
    "examples/shared/venv/venv.tar.gz",
    "examples/tensorflow-multiworkermirrored"
  ],
  "role": [
    {
      "roleName": "worker",
      "num": 2, // 双点训练
      "vcores": 1,
      "memoryMb": 512,
      "jvmMemoryMb": 512,
      "command": "./tensorflow-multiworkermirrored/main.sh venv.tar.gz",
      "successPercent": 100,
      "failover": {
        "hybridDeploymentFailoverPolicy": {
          "commonFailover": {
            "maxFailureTimes": 10,
            "maxFailurePolicy": "FAIL_ATTEMPT"
          }
        }
      }
    }
  ]
}

使用 primus-submit 提交训练!

# Submit Primus application
$ cd ~/primus-playground
$ primus-submit --primus_conf examples/tensorflow-multiworkermirrored/primus_config.json
...
22/03/03 18:42:53 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID>
22/03/03 18:42:53 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1:8088/proxy/<YARN-APPLICATION-ID>/
22/03/03 18:43:03 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10014 ms.
22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: State: FINISHED  Progress: 100.0%
22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 18:44
22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED
...

# Observe YARN logs
$ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN"
...
+ echo FIN
Epoch 1/3
Epoch 2/3
Epoch 3/3
FIN
+ echo FIN
Epoch 1/3
Epoch 2/3
Epoch 3/3
FIN
...

同样的因为这个范例最后有将模型输出到 HDFS 上,所以您也可以透过 Python 脚本测试模型的表现!

$ cd ~/primus-playground/examples/tensorflow-multiworkermirrored
 
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server/
$ export HADOOP_HDFS_HOME=/usr/lib/emr/current/hadoop
$ export CLASSPATH=$(hadoop classpath --glob)

$ python3.9 evaluate.py \
  --mnist hdfs://emr-master-1:8020/user/yarn/mnist/data \
  --model hdfs://emr-master-1:8020/user/yarn/mnist/models/model-multiworkermirrored
...
Model accuracy: [2.0982000827789307, 0.5758000016212463]
...

2.3 Parameter Server

一样的从观察 Primus 训练配置开始,相较于 Singe Node 的 Primus 训练配置,Parameter Server 所需要的 Primus 训练配置巨大了许多。
但是其中最主要的差别只有两个部分,分别是更多的角色以及 PS 和 Worker 这两个角色的退出条件为 0%,因为在 TensorFlow Parameter Server 的分布式策略中,这两种角色属于常驻型角色因此训练进程不会自行退出。

{
  "name": "primus_tensorflow_parameterserver",
  "maxAppAttempts": 1,
  "files": [
    "examples/basics/shared/venv/venv.tar.gz",
    "examples/basics/tensorflow-parameterserver/main.sh", 
    "examples/basics/tensorflow-parameterserver/chief.py",
    "examples/basics/tensorflow-parameterserver/ps.py",
    "examples/basics/tensorflow-parameterserver/worker.py"
  ],
  "role": [ // 更多的角色!
    {
      "roleName": "chief",
      "num": 1,
      "vcores": 1,
      "memoryMb": 512,
      "jvmMemoryMb": 512,
      "command": "./main.sh venv.tar.gz chief.py",
      "successPercent": 100,
      "failover": {
        "commonFailoverPolicy": {
          "maxFailureTimes": 1,
          "maxFailurePolicy": "FAIL_ATTEMPT"
        }
      }
    },
    {
      "roleName": "ps",
      "num": 1,
      "vcores": 1,
      "memoryMb": 512,
      "jvmMemoryMb": 512,
      "command": "./main.sh venv.tar.gz ps.py",
      "successPercent": 0, // TensorFlow strategy 里的 Parameter Server 在是常驻的,因此我们不需要等待这个 progress 完成
      "failover": {
        "commonFailoverPolicy": {
          "maxFailureTimes": 1,
          "maxFailurePolicy": "FAIL_ATTEMPT"
        }
      }
    },
    {
      "roleName": "worker",
      "num": 1,
      "vcores": 1,
      "memoryMb": 512,
      "jvmMemoryMb": 512,
      "command": "./main.sh venv.tar.gz worker.py",
      "successPercent": 0, // 同 Parameter Server
      "failover": {
        "commonFailoverPolicy": {
          "maxFailureTimes": 1,
          "maxFailurePolicy": "FAIL_ATTEMPT"
        }
      }
    }
  ]
}

使用 primus-submit 提交训练!

# Submit Primus application
$ cd ~/primus-playground
$ primus-submit --primus_conf examples/tensorflow-parameterserver/primus_config.json
...
22/03/03 18:58:39 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID>
22/03/03 18:58:39 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1:8088/proxy/<YARN-APPLICATION-ID>/
22/03/03 18:58:49 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10010 ms.
22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: State: FINISHED  Progress: 100.0%
22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 19:00
22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED
...

# Observe YARN logs
$ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN"
...
+ echo FIN
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
FIN
...