Troubleshooting¶

Learn how to troubleshoot possible causes for common issues related to CUDA, NCCL, and distributed training.

Multi-GPU¶

If your program is stuck at

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

it indicates that PyTorch can’t set up the communication between GPUs, and that your system is not configured correctly. Run the diagnose command from the Fabric CLI to investigate:

fabric diagnose

This tool will run basic multi-GPU tests using only PyTorch. Any issues raised here will confirm that the problem is with your system and not with Lightning. Common solutions:

Wrong driver version: The NVIDIA driver for your GPU is too old or too new. You can check the version of the driver by running
```
nvidia-smi --id=0 --query-gpu=driver_version --format=csv,noheader
```
Solution: Install a recent driver. Search online for instructions how to update the driver on your platform.
Peer-to-peer connection is broken: The GPUs can’t communicate with each other. Solution: Try to set the environment variable NCCL_P2P_DISABLE=1. If you rerun your scipt and it fixes the problem, this means that peer-to-peer transport is not working properly (your training will run but it will be slow). This is likely because of driver compatibility issues (see above) or because your GPU does not support peer-to-peer (e.g., certain RTX cards).

Multi-node¶

Before troubleshooting multi-node connectivity issues, first ensure that multi-GPU within a single machine is working correctly by following the steps above. If single-node execution works, but multi-node hangs at

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

it indicates that there is a connection issue between the nodes. Common solutions:

Wrong network interface: Some servers have multiple network interfaces. There is usually only one that can send and receive traffic from the network of the other nodes, but sometimes it is not set as the default. In this case, you need to set it manually:
```
export GLOO_SOCKET_IFNAME=eno1
export NCCL_SOCKET_IFNAME=eno1
fabric run ...
```
You can find the interface name by parsing the output of the ifconfig command. The name of this interface may differ on each node.
NCCL can’t communicate between the nodes:

Follow the steps in the NCCL troubleshooting guide. In particular, take note of the network section that describes restricting the port range and firewall rules.
```
echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
sysctl --system
ufw allow 50000:51000/tcp
```