encode
Now that you have trained a tokenizer, we will use it to encode your text data, that is, to “translate” each document into a sequence of integers. There are no configuration options for doing this.
slangmod encode
After this step, you should have a subdirectory “encodings” in your
work_dir, appended by
the same hash as the tokenizer file (if you didn’t explicitly set it) or
the actual name of the tokenizer file if you did.
This highlights the purpose of that hash. Had you changed any
options, slangmod would have complained that it
cannot find a tokenizer file. That way you can track which encoded documents
have run through which tokenizer and you can have different versions.
The reason encode is a separate step is that it can take a while
(depending on how much data you have) and you don’t want to wait around
every time you start a new model-training run. Courtesy to
swak you could also run both, the
tokenize and the encode step, in one go.
slangmod tokenize encode