ml.launchers.torchrun

Defines a launcher which uses torchrun to launch a job.

This is a light-weight werapper around PyTorch’s torch.distributed.launch script. It is used to launch a job on a single node with multiple processes, each with multiple devices.

class ml.launchers.torchrun.TorchRunLauncherConfig(name: str = '???', nproc_per_node: int = '???', master_addr: str = '127.0.0.1', master_port: int = '???', backend: str = 'nccl', start_method: str = 'spawn', torchrun_path: str = '???')[source]

Bases: BaseLauncherConfig

nproc_per_node: int = '???'
master_addr: str = '127.0.0.1'
master_port: int = '???'
backend: str = 'nccl'
start_method: str = 'spawn'
torchrun_path: str = '???'
classmethod resolve(config: TorchRunLauncherConfig) None[source]

Runs post-construction config resolution.

Parameters:

config – The config to resolve

class ml.launchers.torchrun.TorchRunLauncher(config: BaseConfigT)[source]

Bases: BaseLauncher[TorchRunLauncherConfig]

launch() None[source]

Launches the job by calling the TorchRun CLI in a subprocess.