ml.launchers.torchrun

Defines a launcher which uses torchrun to launch a job.

This is a light-weight werapper around PyTorch’s torch.distributed.launch script. It is used to launch a job on a single node with multiple processes, each with multiple devices.

class ml.launchers.torchrun.TorchRunLauncherConfig(name: str = '???', nproc_per_node: int = '???', master_addr: str = '127.0.0.1', master_port: int = '???', backend: str = 'nccl', start_method: str = 'spawn', torchrun_path: str = '???')[source]

Bases: BaseLauncherConfig

nproc_per_node: int = '???'

master_addr: str = '127.0.0.1'

master_port: int = '???'

backend: str = 'nccl'

start_method: str = 'spawn'

torchrun_path: str = '???'

classmethod resolve(config: TorchRunLauncherConfig) → None[source]

Runs post-construction config resolution.

Parameters:: config – The config to resolve

class ml.launchers.torchrun.TorchRunLauncher(config: BaseConfigT)[source]

Bases: BaseLauncher[TorchRunLauncherConfig]

launch() → None[source]: Launches the job by calling the TorchRun CLI in a subprocess.