data

The first step to train any language model, large or small, is to get yourself some data, the cleaner the better. Because slangmod cannot know which text you want to train your model on, which language(s) that text will be in, etc., it can do precious little to help clean that text. Before we get to what it can do, we will thus specify the format slangmod expects the text data to be in and where it expects it to be.

format

Taking the HuggingFace dataset collection as an example, slangmod expects text data in the form of parquet files. When read with, for example, pandas with the help of PyArrow, this results in a table (a DataFrame). Among the columns in that table, slangmod expects one to contain the text data, one document per row. More often than not, the name of that row is “text” but, as we will see later, this can be configured.

names

Typically, data will be spread out over several such files. Most will be used to train the model, while some will be used to monitor the training progress and, if early stopping is active, to terminate training. In addition, a final evaluation of the model performance will be done on another held-out validation data set.

Consequently, slangmod expects parquet files that contain (one of) the words “train”, “test”, or “validation” in their file names and it will use these fields accordingly. While configurable, the default file extension of parquet files is “.parquet”.

Many data sets on HuggingFace are already split into files with that naming scheme but, if you want to use one that is not, you have to split the data yourself and name the files accordingly.

Important

slangmod relies on the presence of all three, train, test, and validation files to function properly.

location

If you plan to use your data as is, then all files, test, train, and validation, should directly go into a folder named “corpus” inside slangmod’s working directory that you configured as work_dir earlier.

If, however, you plan to leverage slangmod to do some data cleaning for you, then your parquet files can stay in any directory that is not the “corpus” folder inside work_dir.

Note

Even if you don’t want to do any actual data cleaning with slangmod, you can still use the clean command to simply copy files from some source directory into the corpus folder.

eos

At inference time, you want your model to eventually stop producing next tokens, ideally when it has said what it wanted to say. One way to realize this is to stop producing more text when a special “end-of-sequence” (EOS) token is predicted. However, the model can only do so if there are EOS tokens in the training data. Too few too far apart and your model will never shut up. Too many and your model answers might be more concise than you’d like. Therefore, one important decision to make is what exactly should be considered a “sequence” by your model.

The upper bound for the length of a sequence is the length of a document, i.e., the contents of rows in the “text” column of your data files. slangmod will put an EOS token at the end of each. So, if your documents are rather short (say, a few sentences), you don’t have do to anything. If however, you use much longer documents, like E-books, then you will have to either deliberately put markers into your documents that designate an EOS, or identify already existing patterns in your document that slangmod can interpret as EOS.