data
The first step to train any language model, large or small, is to get yourself
some data, the cleaner the better. Because slangmod cannot know which
text you want to train your model on, which language(s) that text will be in,
etc., it can do precious little to help clean that text. Before we get to
what it can do, we will thus specify the format
slangmod expects the text data to be in and where it expects it to be.
format
Taking the HuggingFace dataset collection
as an example, slangmod expects text data in the form of
parquet files. When read
with, for example, pandas with the help of
PyArrow, this results in
a table (a DataFrame). Among the columns in that table, slangmod
expects one to contain the text data, one document per row. More often than
not, the name of that row is “text” but, as we will see
later, this can be configured.
names
Typically, data will be spread out over several such files. Most will be used to train the model, while some will be used to monitor the training progress and, if early stopping is active, to terminate training. In addition, a final evaluation of the model performance will be done on another held-out validation data set.
Consequently, slangmod expects parquet files that contain (one of) the
words “train”, “test”, or “validation” in their file names and it
will use these fields accordingly. While configurable, the default file
extension of parquet files is “.parquet”.
Many data sets on HuggingFace are already split into files with that naming scheme but, if you want to use one that is not, you have to split the data yourself and name the files accordingly.
Important
slangmod relies on the presence of all three, train, test,
and validation files to function properly.
location
If you plan to use your data as is, then all files, test, train, and validation,
should directly go into a folder named “corpus” inside slangmod’s
working directory that you configured as work_dir earlier.
If, however, you plan to leverage slangmod to do some data cleaning for
you, then your parquet files can stay in any directory that is not the
“corpus” folder inside work_dir.
Note
Even if you don’t want to do any actual data cleaning with slangmod,
you can still use the clean command to simply copy files from
some source directory into the corpus folder.
eos
At inference time, you want your model to eventually stop producing next tokens, ideally when it has said what it wanted to say. One way to realize this is to stop producing more text when a special “end-of-sequence” (EOS) token is predicted. However, the model can only do so if there are EOS tokens in the training data. Too few too far apart and your model will never shut up. Too many and your model answers might be more concise than you’d like. Therefore, one important decision to make is what exactly should be considered a “sequence” by your model.
The upper bound for the length of a sequence is the length of a document,
i.e., the contents of rows in the “text” column of your data files.
slangmod will put an EOS token at the end of each. So, if your documents
are rather short (say, a few sentences), you don’t have do to anything.
If however, you use much longer documents, like E-books, then you will have to
either deliberately put markers into your documents that designate an EOS,
or identify already existing patterns in your document that slangmod
can interpret as EOS.