tokenize
The next step after preparing your data is to train a tokenizer. You can chose from three of those provided by the HuggingFace tokenizers package:
The default is unigram, pre-configured with sane settings. You can, therefore, simply invoke:
slangmod tokenize
Should you want to use a different one, just pass it to the command line
slangmod tokenize --tokens.algo <ALGO>
or put it into a new [tokens] section in your config TOML file.
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true
[files]
raw = "/absolute/path/to/data/files"
suffix = "pqt"
column = "document"
min_doc_len = 32
cleaners = ["quotes", "encoding"]
encoding = "cp1252"
[tokens]
algo = "bpe"
slangmod will train a tokenizer and save it into your work_dir.
The name of the file under which it is saved can, in principle, be set with
the --files.tokenizer <FILE NAME> flag on the command line and you could
can also add your preference to the [files] section of your config TOML,
but I strongly advise against setting it explicitly.
Important
In order for your entire workflow to stay consistent, the default name for
the tokenizer file contains a hash of the entire [tokens] section,
including the algo, all options, and all settings
for eos.
Note
Should cou choose to set it anyway, be advised that it does not matter
whether you specify a file extension or not. slangmod will always
save it with a “json” extension because it is a JSON file.
options
All further configuration options are likewise set with --tokens.<KEY> <VALUE>
on the command line and/or go into the [tokens] section of your config TOML
file.
- tokens.vocab = 16384
Maximum vocabulary size for all tokenizers. For a monolingual model in a simple script and clean corpus with few special symbols, this value might work. But it is certainly at the lower end._
- tokens.dropout = 0.0
- tokens.min_freq = 0
Minimum frequency a pair should have in order to be merged. Affects both the BPE trainer and the WordPiece trainer.
- tokens.max_len = 16
Sets the max_input_chars_per_word in the WordPiece tokenizer, the max_token_length in the BPE trainer, and the max_piece_length in the Unigram trainer.
- tokens.shrink_factor = 0.75
shrink_factor of the Unigram trainer.
- tokens.n_iter = 2
n_sub_iterations of the Unigram trainer.
eos
As discussed earlier you need to let your model know when
a sequence ends. The only way to do that is to tokenize a specific pattern as
a special [EOS] token. To indicate which pattern that should be, you need
to set two things:
- tokens.eos_regex = “\n{2,}”
A regular expression that matches the pattern you want to set as EOS. Owing to the data I am working with, I decided to got with the end of a paragraph, that is, two or more consecutive newline characters.
- tokens.eos_string = “\n\n”
This is an example string that must match the regular expression you just specified.
Again, both can be set either on the command line or in your config file.
slangmod tokenize --tokens.eos_string "\n\n" --tokens.eos_regex "\n{2,}"
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true
[files]
raw = "/absolute/path/to/data/files"
suffix = "pqt"
column = "document"
min_doc_len = 32
cleaners = ["quotes", "encoding"]
encoding = "cp1252"
tokenizer = "my_tokenizer.json"
[tokens]
algo = "bpe"
vocab = 30000
eos_string = "\n\n"
eos_regex = "\n{2,}"
Tip
Use the excellent regex 101 with some
sample text from your data to make sure both tokens.eos_regex and
tokens.eos_string are correct.