ml.utils.tokens

Defines utility functions for dealing with tokens and token datasets.

This file provides helper methods for reading and writing compressed datasets of tokens. This compresses the tokens into ceil(log2(num_tokens)) bits per token, with padding at the end of each line to ensure that each line is a multiple of 8 bits. This optimizes for making the file size as small as possible while still being efficient to read from.

Here’s an example of how to use the API:

from ml.utils.tokens import TokenReader, TokenWriter

num_tokens = 6
file_path = "/path/to/dataset.bin"

# Write the tokens to the dataset.
with TokenWriter(file_path, num_tokens) as writer:
    for _ in range(10):
        writer.write([1, 2, 3, 4, 5])

# Read the tokens from the dataset.
reader = TokenReader(file_path)
num_samples = len(reader)
for i in range(num_samples):
    print(reader[i])

You can also read some subset of the tokens in a line using slicing syntax. This syntax will only read the required tokens from the file, rather than reading the entire line and then slicing it. Here is an example:

reader = TokenReader(file_path)
print(reader[0])  # Prints the first line.
print(reader[0, 1:3])  # Prints the first line, but only the second and third tokens.

class ml.utils.tokens.TokenWriter(path: str | Path, num_tokens: int, overwrite_if_exists: bool = False, *, num_tokens_fmt: Literal['Q', 'I', 'H', 'B'] = 'I', lengths_fmt: Literal['Q', 'I', 'H', 'B'] = 'I', offset_fmt: Literal['Q', 'I', 'H', 'B'] = 'Q')[source]

Bases: ContextManager

Helper class for writing a dataset of tokens to a file.

This class can be used in conjunction with TokenReader to write and read datasets of tokens. The default numerical formats are chosen to work well with typical ranges of token datasets. At the upper end, this supports 2 ^ 32 tokens, 2 ^ 32 tokens per line, and 2 ^ 64 tokens per file.

Parameters:

path – The path to the file to write to.
num_tokens – The number of tokens in the dataset.
overwrite_if_exists – Whether to overwrite the file if it already exists.
num_tokens_fmt – The format string for the number of tokens.
lengths_fmt – The format string for the lengths of each line.
offset_fmt – The format string for the offsets of each line.

write(tokens: Iterable[int]) → None[source]

writemany(tokens: Iterable[Iterable[int]]) → None[source]

flush() → None[source]

class ml.utils.tokens.TokenReader(path: str | Path)[source]

Bases: object

Helper class for reading a dataset of tokens from a file.

This class can be used in conjunction with TokenWriter to write and read datasets of tokens.

Parameters:

path – The path to the file to read from.
shard – Read a specific shard from the dataset.

property bits_per_token: int

byte_length(index: int) → int[source]

length(index: int) → int[source]

property byte_lengths: list[int]

property lengths: list[int]

property offsets: list[int]

class ml.utils.tokens.token_file[source]

Bases: object

classmethod to_bytes(tokens: Iterable[int], num_tokens: int) → bytes[source]

classmethod from_bytes(tokens_enc: bytes, seq_len: int, num_tokens: int) → list[int][source]

classmethod open(path: str | Path, mode: Literal['w'], num_tokens: int, overwrite_if_exists: bool = False) → TokenWriter[source]

classmethod open(path: str | Path, mode: Literal['r'] = 'r') → TokenReader

Opens a token file for reading or writing.

Parameters:

path – The path to the token file.
mode – The mode to open the file in. Can be either "r" for reading or "w" for writing.
num_tokens – The number of tokens in the dataset. Required when opening in write mode.
overwrite_if_exists – Whether to overwrite the file if it already exists. Only used when opening in write mode.

Returns:

A TokenReader or TokenWriter depending on mode.