tokenizers

Helpers and wrappers to safely use HuggingFace tokenizers.

class Special(unpredictable=(), *predictable, pad, unk, eos)[source]

Bases: object

Wrapper around special tokens for consistent access across projects.

Parameters:

unpredictable (Iterable of AddedToken) – All special tokens, in the form of AddedToken from the HuggingFace tokenizers package, that a language model might get as input, but will never be required to predict (e.g., mask or unknown). Defaults to an empty tuple. If given, these tokens will come after pad and unk in the vocabulary (i.e., their indices will be 2 and up), but before eos.
*predictable (AddedToken) – Additional special tokens, in the form of AddedToken from the HuggingFace tokenizers package, that the model might be required to predict. If given, this tokens will come after eos.
pad (AddedToken) – The padding token, in the form of an AddedToken from the HuggingFace tokenizers package. It will always have token index 0.
unk (AddedToken) – The unknown token, in the form of an AddedToken from the HuggingFace tokenizers package. It will always have token index 1.
eos (AddedToken) – The end-of-sequence token, in the form of an AddedToken from the HuggingFace tokenizers package. This token comes after the unpredictable tokens, but before the predictable tokens.

Important

Various token IDs and string representations are needed at seemingly disjoint parts of the overall workflow. It is, therefore, deceptively easy to make a mistake somewhere, somehow. Instances of this class are to serve as the single ground truth for your entire project.

property contents: Only the string representations of all special tokens.

property decoder: Dictionary with indices as keys and special tokens as values.

property encoder: Special-token strings as keys and their indices as values.

property eos_id: ID of the end-of-sequence token.

property ids: Only the IDs of all special tokens.

property items: Tuples of (ID, added token) for all special tokens

property pad_id: ID of the padding token. Always 0.

property tokens: Special tokens in the correct order.

property unigram_vocab: Initial vocabulary of special tokens for the Unigram tokenizer.

property unk_id: ID of the unknown token. Always 1.

class Algo(special, tokenizer, trainer=None, normalizer=None, pre_tokenizer=None, decoder=None, post_processor=None)[source]

Bases: object

Wrap a Tokenizer instance for safe, convenient, and consistent usage.

In particular, the train and train_from_iterator methods are overwritten to always use the trainer provided at instantiation of this wrapper (and to return the trained instance). The from_* methods are all overwritten to return an instance of this wrapper instead of a bare tokenizer. All other method calls are simply forwarded to the wrapped Tokenizer, so that instances serve as a drop-in replacement.

Parameters:

special (Special) – An instance of Special, specifying all the special tokens that the tokenizer, the model, and the trainer should be aware of.
tokenizer (Model or Tokenizer) – Instance of a Model or a Tokenizer from the HuggingFace tokenizers package. If it is a Model, a fresh Tokenizer instance will be created from it.
trainer (Trainer, optional) – An instance of a Trainer from the HuggingFace tokenizers package. If not given, an appropriate trainer will be instantiated with its default parameters. Defaults to None
normalizer (Normalizer, optional) – An instance of a Normalizer from the HuggingFace tokenizers package. Defaults to None.
pre_tokenizer (PreTokenizer, optional) – An instance of a PreTokenizer from the HuggingFace tokenizers package. Defaults to None.
decoder (Decoder, optional) – An instance of a Decoder from the HuggingFace tokenizers package. Defaults to None.
post_processor (Processor, optional) – An instance of a Processor from the HuggingFace tokenizers package. Defaults to None.

Important

While the consistency of the special tokens used by the tokenizer and the trainer will be checked for. No such guarantees can be made regarding their potential use in the normalizer, the decoder, or the post_processor.

property eos_id: The ID of the end-of-sequence token.

from_buffer(buffer)[source]

Instantiate a new algo from the given buffer.

Parameters:: buffer (bytes) – A buffer containing a previously serialized Tokenizer.
Returns:: The new algo.
Return type:: Algo

from_file(path)[source]

Instantiate a new Tokenizer from the file at the given path.

Parameters:: path (str) – Full path to the tokenizer file to load.
Returns:: The new algo.
Return type:: Algo

from_pretrained(identifier, revision='main', token=None)[source]

Instantiate a new Tokenizer from pulled from the HuggingFace Hub.

Parameters:

identifier (str) – The identifier of a Model on the HuggingFace Hub that contains a “tokenizer.json” file.
revision (str, optional) – A branch or commit id. Defaults to “main”.
token (str, optional) – An optional auth token used to access private repositories on the HuggingFace Hub.

Returns:

The new algo.

Return type:

Algo

Warning

Because this package does not depend on the HuggingFace transformers package, this method not might work as anticipated or simply not work at all.

from_str(json)[source]

Instantiate a new Tokenizer from the given JSON string.

Parameters:: json (str) – A valid JSON string representing a previously serialized Tokenizer.
Returns:: The new algo.
Return type:: Algo

terminate(encodings)[source]

Add the id of the end-of-sequence token to a list of integers.

If the last integer in the list already is the id of the end-of- sequence token, the original list is returned.

Parameters:: encodings (list of int) – An encoded piece of text.
Returns:: The input encodings, potentially extended by the id of the end-of-sequence token.
Return type:: list of int

train(files)[source]

Train the wrapped tokenizer on the given files.

Parameters:: files (list of str) – A list of paths to the files that should be used for training.
Returns:: Itself with the wrapped Tokenizer now trained.
Return type:: Algo

train_from_iterator(documents)[source]

Train the wrapped tokenizer on the provided iterator over documents.

Parameters:: documents (iterable over str or over list of str) – The documents to train on.
Returns:: Itself with the wrapped Tokenizer now trained.
Return type:: Algo

property unk_id: The ID of the unknown token. Always 1.

property vocab: The target vocabulary size of the wrapped trainer.