clean ===== In order to invoke the ``clean`` command of :mod:`slangmod`, you need to specify the location of your raw :doc:`data files `. You might have seen this command-line option when you tried out the CLI for the first time during basic :doc:`/usage/configuration`: .. code-block:: json "files": { "raw": "/directory/where/you/invoked/slangmod", "suffix": "parquet", "column": "text", "min_doc_len": 1, "cleaners": [], "encoding": "cp1252" } The ``raw`` field, defaulting to whichever directory you invoke :mod:`slangmod` in, should point towards the folder where your data files are located. As you can see, however, this field is *nested* inside of the ``files`` struct. So how do you set this from the command line? Easy. To access nested config fields just append their names to the top-level field with a dot like so: .. code-block:: bash slangmod clean --files.raw relative/or/absolute/path/to/data/files Because the location of this directory is probably not going to change that frequently, it might be a good idea to put it into your config file, again preferring absolute paths over relative ones. .. code-block:: toml :caption: slangmod.toml work_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" Invoking :mod:`slangmod` as described above will do three things: * It will copy all files with the extension ".parquet" that contain either "train", or "test", or "validation" in their names from the ``raw`` folder into the "corpus" subdirectory inside your ``work_dir``. It will not descend into any subfolders of ``raw``. * In doing so, it will filter out documents that are shorter than ``min_doc_len`` characters. Its value defaults to 1 to drop empty documents. * It will rename your data files with a hash of what is inside them to avoid duplicates. .. warning:: Every time you invoke ``slangmod clean`` the "corpus" folder inside your ``work_dir`` will be completely emptied and re-filled from scratch. To *add* more data files instead, you must *resume* cleaning like so: .. code-block:: bash slangmod resume clean options ------- What you can also see is that this is where you can specify the ``suffix`` you use for your parquet files (defaults to "parquet") and the ``column`` in your data table that contains the actual text (defaults to "text"). To set these explicitly on the command line, you would go: .. code-block:: bash slangmod clean --files.suffix pqt --files.column document --files.min_doc_len 32 Because again, these options are not going to change very often, you might as well put them into your config file. .. code-block:: toml :caption: slangmod.toml work_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 .. note:: It does not matter whether you specify the ``suffix`` with or without a leading dot. :mod:`slangmod` will act reasonably. cleaners -------- For the data that I have been playing with, english E-books from `Project Gutenberg `_ (provided as `gutenberg-en-v1-clean `_ by `BEEspoke Data `_) and english Wikipedia articles (a subset of `Wiki-40B `_ provided by `google `_ as `wiki40b `_), I have implemented some actual data cleaning steps. If you plan on using the same or similar data, then maybe they are useful to you as well. 1. Both, Gutenberg E-books and Wikipedia articles contain "weird" quotes to indicate minutes and seconds (*e.g.*, when giving a location with latitude and longitude). In addition, Gutenberg E-books sometimes use typographical single- and double quotes. I chose to simply replace all of these with normal 'single' and "double" quotes, respectively. I you want to do that too, invoke the ``quotes`` *cleaner* on the command line like so: .. code-block:: bash slangmod clean --files.cleaners '["quotes"]' If you want to put that into your config file, extend it like so: .. code-block:: toml :caption: slangmod.toml work_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes"] 2. I decided that I will use the end of a paragraph, that is, two or more consecutive newline characters (``"\n\n"``) as my :ref:`usage/data:eos` pattern. Gutenberg E-books are already formatted that way. To also format the **wiki40b** articles (and only those!) that way, you can invoke the ``wiki40b`` *cleaner* like so: .. code-block:: bash slangmod clean --files.cleaners '["quotes", "wiki40b"]' If you want to put that into your config file too, extend it like so: .. code-block:: toml :caption: slangmod.toml work_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes", "wiki40b"] 3. If, like me, you want to start with training a mono-lingual model, then having characters from a script in your corpus that is not the main script of your primary language unnecessarily blows up your vocabulary size. To avoid this, there is a *cleaner* that replaces all characters that cannot be encoded with a specified ``encoding`` (defaults to "cp1252") with a whitespace. If you want that, you can invoke this cleaner on the command line like so: .. code-block:: bash slangmod clean --files.encoding cp1252 --files.cleaners '["quotes", "wiki40b", "encoding"]' If you want to put that into your config file as well, extend it like so: .. code-block:: toml :caption: slangmod.toml work_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes", "wiki40b", "encoding"] encoding = "cp1252" .. note:: Obviously you can pick any combination and order of these *cleaners*. .. warning:: The *cleaners* you specify on the command line are **not** *added* to those in your config file (or *vice versa*). Rather, the command line overwrites the entire list in your config file. .. important:: Always double check the data that ends up in your "**corpus**" folder and make sure that it adheres to the expected format.