:orphan: ################## Bare Bones Cluster ################## **Audience**: Users who want to train on multiple machines that aren't part of a managed cluster. This guide shows how to run a training job on a general-purpose cluster. It assumes that you can log in to each machine and run commands. Don't want to maintain your own infrastructure? Try the :doc:`Lightning cloud <./cloud>` instead. ---- ************ Requirements ************ To set up a multi-node computing cluster, you need the following: 1. Multiple computers with Lightning installed 2. A network connectivity between the machines with firewall rules that allow traffic flow on a specified port. | We highly recommend setting up a shared filesystem to avoid the cumbersome copying of files between machines. ---- *************************** Prepare the training script *************************** .. code-block:: python :caption: train.py from lightning.fabric import Fabric fabric = Fabric() # The rest of the training script ... We intentionally omit to specify ``strategy``, ``devices``, and ``num_nodes`` here because these settings will get supplied through the CLI in the later steps. You can still hard-code other options if you like. ---- ********************************* Launch the script on your cluster ********************************* **Step 1**: Upload the training script and all needed files to the cluster. Each node needs access to the same files. If the nodes don't attach to a shared network drive, you'll need to upload the files to each node separately. **Step 2**: Pick one of the nodes as your main node and write down its IP address. Example: 10.10.10.16 **Step 3**: Launch the script on each node using the Lightning CLI. In this example, we want to launch training across two nodes, each with 8 GPUs. Log in to the **first node** and run this command: .. code-block:: bash :emphasize-lines: 2,3 fabric run \ --node-rank=0 \ --main-address=10.10.10.16 \ --accelerator=cuda \ --devices=8 \ --num-nodes=2 \ train.py Log in to the **second node** and run this command: .. code-block:: bash :emphasize-lines: 2,3 fabric run \ --node-rank=1 \ --main-address=10.10.10.16 \ --accelerator=cuda \ --devices=8 \ --num-nodes=2 \ train.py Note: The only difference between the two commands is the ``--node-rank`` setting, which identifies each node. After executing these commands, you should immediately see an output like this: .. code-block:: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16 ... ---- *************** Troubleshooting *************** Please refer to the :doc:`troubleshooting guide <../troubleshooting>` guide if you are experiencing issues related to multi-node training hanging or crashing. If you are sick of troubleshooting cluster problems, give :doc:`Lightning Studios <./cloud>` a try! For other questions, please don't hesitate to join the `Discord `_.