ml.trainers.mixins.heartbeat
A simple mixin for monitoring if the main training job is still alive.
If this mixin detects that the training job has died then it will kill the current process.
- class ml.trainers.mixins.heartbeat.HeartbeatConfig(name: str = '???', exp_name: str = '${ml.exp_name:null}', exp_dir: str = '???', log_dir_name: str = 'logs', use_double_weight_precision: bool = False, checkpoint: ml.trainers.base.CheckpointConfig = <factory>, heartbeat_ping_interval: float = 1800.0)[source]
Bases:
MonitorProcessConfig
- heartbeat_ping_interval: float = 1800.0
- ml.trainers.mixins.heartbeat.worker(heartbeat_interval: float, heartbeat_event: Event, start_event: Event, pid: int, on_heartbeat: Callable[[int, Event], None]) None [source]
- class ml.trainers.mixins.heartbeat.HeartbeatMonitor(heartbeat_interval: float, manager: SyncManager, on_heartbeat: Callable[[int, Event], None] | None)[source]
Bases:
object
- class ml.trainers.mixins.heartbeat.HeartbeatMonitorMixin(config: HeartbeatConfigT)[source]
Bases:
MonitorProcessMixin
[HeartbeatConfigT
,ModelT
,TaskT
]Defines a trainer mixin for running a heartbeat process.
- on_training_start(state: State, task: TaskT, model: ModelT, optim: Optimizer | dict[str, torch.optim.optimizer.Optimizer], lr_sched: SchedulerAdapter | dict[str, ml.lr_schedulers.base.SchedulerAdapter]) None [source]
- on_training_end(state: State, task: TaskT, model: ModelT, optim: Optimizer | dict[str, torch.optim.optimizer.Optimizer], lr_sched: SchedulerAdapter | dict[str, ml.lr_schedulers.base.SchedulerAdapter]) None [source]
- on_step_start(state: State, task: TaskT, model: ModelT, optim: Optimizer | dict[str, torch.optim.optimizer.Optimizer], lr_sched: SchedulerAdapter | dict[str, ml.lr_schedulers.base.SchedulerAdapter]) None [source]