tokenizers
Helpers and wrappers to safely use HuggingFace tokenizers.
- class Special(unpredictable=(), *predictable, pad, unk, eos)[source]
Bases:
objectWrapper around special tokens for consistent access across projects.
- Parameters:
unpredictable (Iterable of AddedToken) – All special tokens, in the form of AddedToken from the HuggingFace tokenizers package, that a language model might get as input, but will never be required to predict (e.g., mask or unknown). Defaults to an empty tuple. If given, these tokens will come after pad and unk in the vocabulary (i.e., their indices will be 2 and up), but before eos.
*predictable (AddedToken) – Additional special tokens, in the form of AddedToken from the HuggingFace tokenizers package, that the model might be required to predict. If given, this tokens will come after eos.
pad (AddedToken) – The padding token, in the form of an AddedToken from the HuggingFace tokenizers package. It will always have token index 0.
unk (AddedToken) – The unknown token, in the form of an AddedToken from the HuggingFace tokenizers package. It will always have token index 1.
eos (AddedToken) – The end-of-sequence token, in the form of an AddedToken from the HuggingFace tokenizers package. This token comes after the unpredictable tokens, but before the predictable tokens.
Important
Various token IDs and string representations are needed at seemingly disjoint parts of the overall workflow. It is, therefore, deceptively easy to make a mistake somewhere, somehow. Instances of this class are to serve as the single ground truth for your entire project.
- property contents
Only the string representations of all special tokens.
- property decoder
Dictionary with indices as keys and special tokens as values.
- property encoder
Special-token strings as keys and their indices as values.
- property eos_id
ID of the end-of-sequence token.
- property ids
Only the IDs of all special tokens.
- property items
Tuples of (ID, added token) for all special tokens
- property pad_id
ID of the padding token. Always 0.
- property tokens
Special tokens in the correct order.
- property unigram_vocab
Initial vocabulary of special tokens for the Unigram tokenizer.
- property unk_id
ID of the unknown token. Always 1.
- class Algo(special, tokenizer, trainer=None, normalizer=None, pre_tokenizer=None, decoder=None, post_processor=None)[source]
Bases:
objectWrap a Tokenizer instance for safe, convenient, and consistent usage.
In particular, the
trainandtrain_from_iteratormethods are overwritten to always use the trainer provided at instantiation of this wrapper (and to return the trained instance). Thefrom_*methods are all overwritten to return an instance of this wrapper instead of a bare tokenizer. All other method calls are simply forwarded to the wrappedTokenizer, so that instances serve as a drop-in replacement.- Parameters:
special (Special) – An instance of
Special, specifying all the special tokens that the tokenizer, the model, and the trainer should be aware of.tokenizer (Model or Tokenizer) – Instance of a Model or a Tokenizer from the HuggingFace tokenizers package. If it is a
Model, a freshTokenizerinstance will be created from it.trainer (Trainer, optional) – An instance of a Trainer from the HuggingFace tokenizers package. If not given, an appropriate trainer will be instantiated with its default parameters. Defaults to
Nonenormalizer (Normalizer, optional) – An instance of a Normalizer from the HuggingFace tokenizers package. Defaults to
None.pre_tokenizer (PreTokenizer, optional) – An instance of a PreTokenizer from the HuggingFace tokenizers package. Defaults to
None.decoder (Decoder, optional) – An instance of a Decoder from the HuggingFace tokenizers package. Defaults to
None.post_processor (Processor, optional) – An instance of a Processor from the HuggingFace tokenizers package. Defaults to
None.
Important
While the consistency of the special tokens used by the tokenizer and the trainer will be checked for. No such guarantees can be made regarding their potential use in the normalizer, the decoder, or the post_processor.
- property eos_id
The ID of the end-of-sequence token.
- from_buffer(buffer)[source]
Instantiate a new algo from the given buffer.
- Parameters:
buffer (bytes) – A buffer containing a previously serialized
Tokenizer.- Returns:
The new algo.
- Return type:
- from_file(path)[source]
Instantiate a new Tokenizer from the file at the given path.
- Parameters:
path (str) – Full path to the tokenizer file to load.
- Returns:
The new algo.
- Return type:
- from_pretrained(identifier, revision='main', token=None)[source]
Instantiate a new Tokenizer from pulled from the HuggingFace Hub.
- Parameters:
identifier (str) – The identifier of a Model on the HuggingFace Hub that contains a “tokenizer.json” file.
revision (str, optional) – A branch or commit id. Defaults to “main”.
token (str, optional) – An optional auth token used to access private repositories on the HuggingFace Hub.
- Returns:
The new algo.
- Return type:
Warning
Because this package does not depend on the HuggingFace transformers package, this method not might work as anticipated or simply not work at all.
- from_str(json)[source]
Instantiate a new Tokenizer from the given JSON string.
- Parameters:
json (str) – A valid JSON string representing a previously serialized Tokenizer.
- Returns:
The new algo.
- Return type:
- terminate(encodings)[source]
Add the id of the end-of-sequence token to a list of integers.
If the last integer in the list already is the id of the end-of- sequence token, the original list is returned.
- Parameters:
encodings (list of int) – An encoded piece of text.
- Returns:
The input encodings, potentially extended by the id of the end-of-sequence token.
- Return type:
list of int
- train(files)[source]
Train the wrapped tokenizer on the given files.
- Parameters:
files (list of str) – A list of paths to the files that should be used for training.
- Returns:
Itself with the wrapped
Tokenizernow trained.- Return type:
- train_from_iterator(documents)[source]
Train the wrapped tokenizer on the provided iterator over documents.
- Parameters:
documents (iterable over str or over list of str) – The documents to train on.
- Returns:
Itself with the wrapped
Tokenizernow trained.- Return type:
- property unk_id
The ID of the unknown token. Always 1.
- property vocab
The target vocabulary size of the wrapped trainer.