ml.utils.parallel
Utility functions for configuring distributed parallel training.
Distributed training is broken up into three types of parallelism:
Model Parallelism
Model parallelism partitions a single layer across multiple GPUs. During the forward pass, within a layer, different GPUs perform different parts of the computation, then communicate the results to each other.
Data Parallelism
Data parallelism splits the data across multiple GPUs. During the forward pass, each GPU performs the same computation on different data, then communicates the results to each other.
Pipeline Parallelism
Pipeline parallelism splits the model across multiple GPUs. During the forward pass, the output of one layer is computed on one GPU, then passed to the next layer on another GPU.
Parallelism Example
Consider doing distributed training on a model with 8 total GPUs. The model is split length-wise into two parts, and each part is split width-wise into two more parts. This gives a model parallelism of 4 and a data parallelism of 2.
The model parallel groups are then [[0, 1], [2, 3], [4, 5], [6, 7]]
. This
means that when GPUs 0 and 1 are finished computing their part of some layer,
they will communicate the results to each other. The same is true for the other
pairs of GPUs.
The pipeline parallel groups are [[0, 2], [1, 3], [4, 6], [5, 7]]
. This
means that when GPU 0 is finished computing its part of some layer and
syncing with GPU 1, it will communicate the output to GPU 2.
The data parallel groups are [[0, 4], [1, 5], [2, 6], [3, 7]]
. This means
that each minibatch will be split in half, with one half being sent to GPUS
[0, 1, 2, 3]
and the other half being sent to GPUs [4, 5, 6, 7]
.
So in summary, the resulting groups are:
Model parallel groups
:[[0, 1], [2, 3], [4, 5], [6, 7]]
Data parallel groups
:[[0, 4], [1, 5], [2, 6], [3, 7]]
Pipeline parallel groups
:[[0, 2], [1, 3], [4, 6], [5, 7]]
- ml.utils.parallel.init_parallelism(model_parallelism: int = 1, pipeline_parallelism: int = 1, *, mp_backend: str | Backend | None = None, pp_backend: str | Backend | None = None, dp_backend: str | Backend | None = None) None [source]
Initializes parallelism groups and parameters.
- Parameters:
model_parallelism – Number of model parallel GPUs. Each layer of computation will simultaneously run on this many GPUs.
pipeline_parallelism – Number of pipeline parallel layers. The total number of GPUs processing a single input will be the product of
model_parallelism
andpipeline_parallelism
.mp_backend – Backend to use for model parallelism.
pp_backend – Backend to use for pipeline parallelism.
dp_backend – Backend to use for data parallelism.
- Raises:
ParallismError – If some settings are invalid.