ml.launchers.ec2

Defines a launcher for running jobs on AWS EC2.

Steps

Stages the environment to a new working directory
Writes a user data script which runs the job, then shuts down the instance
Starts an EC2 instance with the user data script, copying the staged environment
Writes information about the running job to the working directory

This allows for repeatability by just scheduling the same job again. The job can be debugged by SSHing into the instance and viewing the log file.

Requirements

Since this launcher runs on EC2, it should be possible to install all dependencies using pip install . from the root directory of the reposito

EC2 Instance Types

Below are some EC2 instances which can be used for training models. Prices are quoted either for on-demand GPUs of August 16th, 2023, although spot instances and reserved instances can be significantly cheaper.

This list is non-exhaustive, since there are additional differences between instances based on network bandwidth, local storage, and other considerations.

V100 - 32 GB GPU memory, SM70
- p3.2xlarge: 1 GPU, 8 vCPUs, 61 GB RAM, $3.06/hour/GPU
- p3.8xlarge: 4 GPUs, 32 vCPUs, 244 GB RAM, $3.06/hour/GPU
- p3.16xlarge: 8 GPUs, 64 vCPUs, 488 GB RAM, $3.06/hour/GPU
- p3dn.24xlarge: 8 GPUs, 96 vCPUs, 768 GB RAM, $3.90/hour/GPU
A100 - 40 GB (P4d) or 80 GB (P4de) GPU memory, SM80
- p4d.24xlarge: 8 GPUs, 96 vCPUs, 1152 GB RAM, $4.10/hour/GPU
- p4de.24xlarge: 8 GPUs, 96 vCPUs, 1152 GB RAM, $5.12/hour/GPU
M60 - 8 GB GPU memory, SM50
- g3s.xlarge: 1 GPU, 4 vCPUs, 30 GB RAM, $0.75/hour/GPU
- g3.4xlarge: 1 GPU, 16 vCPUs, 122 GB RAM, $1.14/hour/GPU
- g3.8xlarge: 2 GPUs, 32 vCPUs, 244 GB RAM, $1.14/hour/GPU
- g3.16xlarge: 4 GPUs, 64 vCPUs, 488 GB RAM, $1.14/hour/GPU
T4 - 16 GB GPU memory, SM75
- g4dn.xlarge: 1 GPU, 4 vCPUs, 16 GB RAM, $0.526/hour/GPU
- g4dn.2xlarge: 1 GPU, 8 vCPUs, 32 GB RAM, $0.752/hour/GPU
- g4dn.4xlarge: 1 GPU, 16 vCPUs, 64 GB RAM, $1.204/hour/GPU
- g4dn.8xlarge: 1 GPU, 32 vCPUs, 128 GB RAM, $2.176/hour/GPU
- g4dn.12xlarge: 4 GPUs, 48 vCPUs, 192 GB RAM, $0.978/hour/GPU
- g4dn.16xlarge: 1 GPUs, 64 vCPUs, 256 GB RAM, $4.352/hour/GPU
- g4dn.metal: 8 GPUs, 96 vCPUs, 384 GB RAM, $0.978/hour/GPU
T4G - 16 GB GPU memory, SM75
- g5g.xlarge: 1 GPU, 4 vCPUs, 8 GB RAM, $0.42/hour/GPU
- g5g.2xlarge: 1 GPU, 8 vCPUs, 16 GB RAM, $0.556/hour/GPU
- g5g.4xlarge: 1 GPU, 16 vCPUs, 32 GB RAM, $0.828/hour/GPU
- g5g.8xlarge: 1 GPU, 32 vCPUs, 64 GB RAM, $1.372/hour/GPU
- g5g.16xlarge: 2 GPU, 64 vCPUs, 128 GB RAM, $1.372/hour/GPU
- g5g.metal: 2 GPU, 64 vCPUs, 128 GB RAM, $1.372/hour/GPU
A10G - 24 GB GPU memory, SM86
- g5.xlarge: 1 GPU, 4 vCPUs, 16 GB RAM, $1.006/hour/GPU
- g5.2xlarge: 1 GPU, 8 vCPUs, 32 GB RAM, $1.212/hour/GPU
- g5.4xlarge: 1 GPU, 16 vCPUs, 64 GB RAM, $1.624/hour/GPU
- g5.8xlarge: 1 GPU, 32 vCPUs, 128 GB RAM, $2.448/hour/GPU
- g5.12xlarge: 4 GPUs, 48 vCPUs, 192 GB RAM, $1.418/hour/GPU
- g5.16xlarge: 1 GPUs, 64 vCPUs, 256 GB RAM, $4.096/hour/GPU
- g5.24xlarge: 4 GPUs, 96 vCPUs, 384 GB RAM, $2.036/hour/GPU
- g5.48xlarge: 8 GPUs, 192 vCPUs, 768 GB RAM, $2.036/hour/GPU
Trainium - 32 GB Accelerator memory (Amazon-specific training chip)
- trn1.2xlarge: 1 accelerator, 8 vCPUs, 32 GB RAM, 1.34/hour/accelerator
- trn1.32xlarge: 16 accelerators, 128 vCPUs, 512 GB RAM, 1.34/hour/accelerator
- trn1n.32xlarge: Same as trn1.32xlarge but with double network bandwidth