ml.utils.distributed
Defines distributed training parameters.
These parameters apply to any distributed training jobs. For model-parallel
training, please refer to ml.models.parallel.env
.
RANK
: The rank of the current process.WORLD_SIZE
: The total number of processes.MASTER_ADDR
: The address of the master process.MASTER_PORT
: The port of the master process.INIT_METHOD
: The method to initialize the process group.