ml.utils.tokens
Defines utility functions for dealing with tokens and token datasets.
This file provides helper methods for reading and writing compressed datasets
of tokens. This compresses the tokens into ceil(log2(num_tokens))
bits per
token, with padding at the end of each line to ensure that each line is a
multiple of 8 bits. This optimizes for making the file size as small as
possible while still being efficient to read from.
Here’s an example of how to use the API:
from ml.utils.tokens import TokenReader, TokenWriter
num_tokens = 6
file_path = "/path/to/dataset.bin"
# Write the tokens to the dataset.
with TokenWriter(file_path, num_tokens) as writer:
for _ in range(10):
writer.write([1, 2, 3, 4, 5])
# Read the tokens from the dataset.
reader = TokenReader(file_path)
num_samples = len(reader)
for i in range(num_samples):
print(reader[i])
You can also read some subset of the tokens in a line using slicing syntax. This syntax will only read the required tokens from the file, rather than reading the entire line and then slicing it. Here is an example:
reader = TokenReader(file_path)
print(reader[0]) # Prints the first line.
print(reader[0, 1:3]) # Prints the first line, but only the second and third tokens.
- class ml.utils.tokens.TokenWriter(path: str | Path, num_tokens: int, overwrite_if_exists: bool = False, *, num_tokens_fmt: Literal['Q', 'I', 'H', 'B'] = 'I', lengths_fmt: Literal['Q', 'I', 'H', 'B'] = 'I', offset_fmt: Literal['Q', 'I', 'H', 'B'] = 'Q')[source]
Bases:
ContextManager
Helper class for writing a dataset of tokens to a file.
This class can be used in conjunction with
TokenReader
to write and read datasets of tokens. The default numerical formats are chosen to work well with typical ranges of token datasets. At the upper end, this supports2 ^ 32
tokens,2 ^ 32
tokens per line, and2 ^ 64
tokens per file.- Parameters:
path – The path to the file to write to.
num_tokens – The number of tokens in the dataset.
overwrite_if_exists – Whether to overwrite the file if it already exists.
num_tokens_fmt – The format string for the number of tokens.
lengths_fmt – The format string for the lengths of each line.
offset_fmt – The format string for the offsets of each line.
- class ml.utils.tokens.TokenReader(path: str | Path)[source]
Bases:
object
Helper class for reading a dataset of tokens from a file.
This class can be used in conjunction with
TokenWriter
to write and read datasets of tokens.- Parameters:
path – The path to the file to read from.
shard – Read a specific shard from the dataset.
- property bits_per_token: int
- property byte_lengths: list[int]
- property lengths: list[int]
- property offsets: list[int]
- class ml.utils.tokens.token_file[source]
Bases:
object
- classmethod open(path: str | Path, mode: Literal['w'], num_tokens: int, overwrite_if_exists: bool = False) TokenWriter [source]
- classmethod open(path: str | Path, mode: Literal['r'] = 'r') TokenReader
Opens a token file for reading or writing.
- Parameters:
path – The path to the token file.
mode – The mode to open the file in. Can be either
"r"
for reading or"w"
for writing.num_tokens – The number of tokens in the dataset. Required when opening in write mode.
overwrite_if_exists – Whether to overwrite the file if it already exists. Only used when opening in write mode.
- Returns:
A
TokenReader
orTokenWriter
depending on mode.