ml.tasks.datasets.samplers

Custom samplers for datasets.

class ml.tasks.datasets.samplers.ChunkSampler(dataset: Sized, batch_size: int, shuffle: bool = False)[source]

Bases: Sampler[int]

Sampler which yields chunks of adjacent IDs.

This sampler is useful for cases like seq2seq models with variable output length sequences and padding; it is more efficient to put similar-length sequences next to each other so that the average collated tensor is smaller and has less padding. In such cases, simply sorting the underlying dataset by caption length and using this sampler yields the desired behavior.

Parameters:
  • dataset – The dataset to sample from

  • batch_size – The size of each chunk

  • shuffle – Yield chunks in random order or from first to last