data ==== The first step to train any language model, large or small, is to get yourself some data, the cleaner the better. Because :mod:`slangmod` cannot know which text you want to train your model on, which language(s) that text will be in, *etc.*, it can do precious little to help clean that text. Before we get to what it :doc:`can do `, we will thus specify the format :mod:`slangmod` expects the text data to be in and where it expects it to be. format ------ Taking the HuggingFace `dataset collection `_ as an example, :mod:`slangmod` expects text data in the form of `parquet `_ files. When read with, for example, `pandas `_ with the help of `PyArrow `_, this results in a table (a `DataFrame `_). Among the columns in that table, :mod:`slangmod` expects one to contain the text data, one document per row. More often than not, the name of that row is "**text**" but, as we will see :doc:`later `, this can be configured. names ----- Typically, data will be spread out over several such files. Most will be used to train the model, while some will be used to monitor the training progress and, if early stopping is active, to terminate training. In addition, a final evaluation of the model performance will be done on another held-out validation data set. Consequently, :mod:`slangmod` expects parquet files that contain (one of) the words "**train**", "**test**", or "**validation**" in their file names and it will use these fields accordingly. While configurable, the default file extension of parquet files is "**.parquet**". Many data sets on `HuggingFace `_ are already split into files with that naming scheme but, if you want to use one that is not, you have to split the data yourself and name the files accordingly. .. important:: :mod:`slangmod` relies on the presence of all three, **train**, **test**, and **validation** files to function properly. location --------- If you plan to use your data as is, then all files, test, train, and validation, should directly go into a folder named "**corpus**" inside :mod:`slangmod`'s *working directory* that you configured as ``work_dir`` :doc:`earlier `. If, however, you plan to leverage :mod:`slangmod` to do some data cleaning for you, then your parquet files can stay in any directory that is **not** the "corpus" folder inside ``work_dir``. .. note:: Even if you don't want to do any actual data cleaning with :mod:`slangmod`, you can still use the :doc:`/cli/clean` command to simply copy files from some source directory into the **corpus** folder. eos --- At inference time, you want your model to eventually stop producing next tokens, ideally when it has said what it wanted to say. One way to realize this is to stop producing more text when a special "end-of-sequence" (EOS) token is predicted. However, the model can only do so if there are EOS tokens in the training data. Too few too far apart and your model will never shut up. Too many and your model answers might be more concise than you'd like. Therefore, one important decision to make is what exactly should be considered a "**sequence**" by your model. The upper bound for the length of a sequence is the length of a document, *i.e.*, the contents of rows in the "text" column of your data files. :mod:`slangmod` will put an EOS token at the end of each. So, if your documents are rather short (say, a few sentences), you don't have do to anything. If however, you use much longer documents, like E-books, then you will have to either deliberately put markers into your documents that designate an EOS, or identify already existing patterns in your document that :mod:`slangmod` can interpret as EOS.