多节点使用PyTorch DistributedDataParallel训练时进程卡住求助

阿华AIGC实验室

2026-5-9

Troubleshooting Stuck Multi-Node PyTorch DDP Training for MNIST

Hey there, let's figure out why your multi-node distributed training is hanging while single-node works perfectly. The root cause is almost always related to network communication issues between your two nodes or misconfigured distributed settings. Here's a step-by-step breakdown to diagnose and fix it:

1. Verify Network Connectivity Between Nodes

Since you're using the NCCL backend, your nodes need unobstructed access to the master node's IP and port.

First, test basic connectivity: On the secondary node, run ping a.b.c.d to confirm it can reach the master node. If ping fails, check your network setup (router settings, VPN configuration if applicable).
Next, check if the master port is open: On the secondary node, run telnet a.b.c.d 8890 or nc -zv a.b.c.d 8890. If this fails, your firewall (ufw, iptables, or cloud security groups) is blocking the port. You'll need to open port 8890 on both nodes, or temporarily disable the firewall for testing purposes.

2. Enable NCCL Debug Logs

Add NCCL debug environment variables to get detailed logs about where the process is stuck. Modify your launch commands to include:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0  # Replace with your actual network interface (check via `ip addr`)
python mnist-distributed.py -n 2 -g 2 -nr <0 or 1>

The NCCL_DEBUG=INFO will print detailed communication logs, which will show if the nodes are failing to connect or getting stuck during data synchronization. The NCCL_SOCKET_IFNAME ensures NCCL uses the correct network interface (critical if your nodes have multiple network adapters).

3. Ensure Consistent Versions Across Nodes

Mismatched versions of PyTorch, CUDA, or NCCL can cause silent communication failures. Run these commands on both nodes and confirm the outputs are identical:

python -c "import torch; print(torch.__version__)"
python -c "import torch; print(torch.version.cuda)"
python -c "import torch.distributed as dist; print(dist.nccl.version())"

If versions differ, reinstall the same PyTorch/CUDA/NCCL stack on both nodes.

4. Double-Check Launch Parameters

Make sure you're using exactly the same parameters on both nodes except for -nr:

Master node: python mnist-distributed.py -n 2 -g 2 -nr 0
Secondary node: python mnist-distributed.py -n 2 -g 2 -nr 1
A common mistake is mismatching --nodes or --gpus values between nodes, which breaks the distributed group setup.

5. Confirm Dataset Accessibility

While less likely to cause a complete hang, ensure both nodes have access to the MNIST dataset. Either:

Let each node download the dataset independently (your current code has download=True, so this should happen automatically)
Or use a shared filesystem (like NFS) so both nodes can access the same ./data directory.

Quick Test to Isolate the Issue

If you're still stuck, try running a minimal distributed test script on both nodes to rule out your model/dataset code. For example:

import torch
import torch.distributed as dist
import os
import torch.multiprocessing as mp

def test(gpu, args):
    rank = args.nr * args.gpus + gpu
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    print(f"Rank {rank} connected successfully")
    dist.barrier()
    print(f"Rank {rank} passed barrier")
    dist.destroy_process_group()

def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int)
    parser.add_argument('-g', '--gpus', default=1, type=int)
    parser.add_argument('-nr', '--nr', default=0, type=int)
    args = parser.parse_args()
    args.world_size = args.gpus * args.nodes
    os.environ['MASTER_ADDR'] = 'a.b.c.d'
    os.environ['MASTER_PORT'] = '8890'
    mp.spawn(test, nprocs=args.gpus, args=(args,))

if __name__ == "__main__":
    main()

If this minimal script also hangs, the issue is definitely network or environment-related, not your MNIST model code.

内容的提问来源于stack exchange，提问作者motor_junkie