encode

Now that you have trained a tokenizer, we will use it to encode your text data, that is, to “translate” each document into a sequence of integers. There are no configuration options for doing this.

slangmod encode

After this step, you should have a subdirectory “encodings” in your work_dir, appended by

  • the same hash as the tokenizer file (if you didn’t explicitly set it) or

  • the actual name of the tokenizer file if you did.

This highlights the purpose of that hash. Had you changed any options, slangmod would have complained that it cannot find a tokenizer file. That way you can track which encoded documents have run through which tokenizer and you can have different versions.

The reason encode is a separate step is that it can take a while (depending on how much data you have) and you don’t want to wait around every time you start a new model-training run. Courtesy to swak you could also run both, the tokenize and the encode step, in one go.

slangmod tokenize encode