多节点使用PyTorch DistributedDataParallel训练时进程卡住求助
Hey there, let's figure out why your multi-node distributed training is hanging while single-node works perfectly. The root cause is almost always related to network communication issues between your two nodes or misconfigured distributed settings. Here's a step-by-step breakdown to diagnose and fix it:
1. Verify Network Connectivity Between Nodes
Since you're using the NCCL backend, your nodes need unobstructed access to the master node's IP and port.
- First, test basic connectivity: On the secondary node, run
ping a.b.c.dto confirm it can reach the master node. If ping fails, check your network setup (router settings, VPN configuration if applicable). - Next, check if the master port is open: On the secondary node, run
telnet a.b.c.d 8890ornc -zv a.b.c.d 8890. If this fails, your firewall (ufw, iptables, or cloud security groups) is blocking the port. You'll need to open port 8890 on both nodes, or temporarily disable the firewall for testing purposes.
2. Enable NCCL Debug Logs
Add NCCL debug environment variables to get detailed logs about where the process is stuck. Modify your launch commands to include:
export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME=eth0 # Replace with your actual network interface (check via `ip addr`) python mnist-distributed.py -n 2 -g 2 -nr <0 or 1>
The NCCL_DEBUG=INFO will print detailed communication logs, which will show if the nodes are failing to connect or getting stuck during data synchronization. The NCCL_SOCKET_IFNAME ensures NCCL uses the correct network interface (critical if your nodes have multiple network adapters).
3. Ensure Consistent Versions Across Nodes
Mismatched versions of PyTorch, CUDA, or NCCL can cause silent communication failures. Run these commands on both nodes and confirm the outputs are identical:
python -c "import torch; print(torch.__version__)" python -c "import torch; print(torch.version.cuda)" python -c "import torch.distributed as dist; print(dist.nccl.version())"
If versions differ, reinstall the same PyTorch/CUDA/NCCL stack on both nodes.
4. Double-Check Launch Parameters
Make sure you're using exactly the same parameters on both nodes except for -nr:
- Master node:
python mnist-distributed.py -n 2 -g 2 -nr 0 - Secondary node:
python mnist-distributed.py -n 2 -g 2 -nr 1
A common mistake is mismatching--nodesor--gpusvalues between nodes, which breaks the distributed group setup.
5. Confirm Dataset Accessibility
While less likely to cause a complete hang, ensure both nodes have access to the MNIST dataset. Either:
- Let each node download the dataset independently (your current code has
download=True, so this should happen automatically) - Or use a shared filesystem (like NFS) so both nodes can access the same
./datadirectory.
Quick Test to Isolate the Issue
If you're still stuck, try running a minimal distributed test script on both nodes to rule out your model/dataset code. For example:
import torch import torch.distributed as dist import os import torch.multiprocessing as mp def test(gpu, args): rank = args.nr * args.gpus + gpu dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank) print(f"Rank {rank} connected successfully") dist.barrier() print(f"Rank {rank} passed barrier") dist.destroy_process_group() def main(): import argparse parser = argparse.ArgumentParser() parser.add_argument('-n', '--nodes', default=1, type=int) parser.add_argument('-g', '--gpus', default=1, type=int) parser.add_argument('-nr', '--nr', default=0, type=int) args = parser.parse_args() args.world_size = args.gpus * args.nodes os.environ['MASTER_ADDR'] = 'a.b.c.d' os.environ['MASTER_PORT'] = '8890' mp.spawn(test, nprocs=args.gpus, args=(args,)) if __name__ == "__main__": main()
If this minimal script also hangs, the issue is definitely network or environment-related, not your MNIST model code.
内容的提问来源于stack exchange,提问作者motor_junkie




