clean

In order to invoke the clean command of slangmod, you need to specify the location of your raw data files. You might have seen this command-line option when you tried out the CLI for the first time during basic configuration:

"files": {
    "raw": "/directory/where/you/invoked/slangmod",
    "suffix": "parquet",
    "column": "text",
    "min_doc_len": 1,
    "cleaners": [],
    "encoding": "cp1252"
}

The raw field, defaulting to whichever directory you invoke slangmod in, should point towards the folder where your data files are located. As you can see, however, this field is nested inside of the files struct. So how do you set this from the command line? Easy. To access nested config fields just append their names to the top-level field with a dot like so:

slangmod clean --files.raw relative/or/absolute/path/to/data/files

Because the location of this directory is probably not going to change that frequently, it might be a good idea to put it into your config file, again preferring absolute paths over relative ones.

slangmod.toml
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true

[files]
raw = "/absolute/path/to/data/files"

Invoking slangmod as described above will do three things:

  • It will copy all files with the extension “.parquet” that contain either “train”, or “test”, or “validation” in their names from the raw folder into the “corpus” subdirectory inside your work_dir. It will not descend into any subfolders of raw.

  • In doing so, it will filter out documents that are shorter than min_doc_len characters. Its value defaults to 1 to drop empty documents.

  • It will rename your data files with a hash of what is inside them to avoid duplicates.

Warning

Every time you invoke slangmod clean the “corpus” folder inside your work_dir will be completely emptied and re-filled from scratch. To add more data files instead, you must resume cleaning like so:

slangmod resume clean

options

What you can also see is that this is where you can specify the suffix you use for your parquet files (defaults to “parquet”) and the column in your data table that contains the actual text (defaults to “text”). To set these explicitly on the command line, you would go:

slangmod clean --files.suffix pqt --files.column document --files.min_doc_len 32

Because again, these options are not going to change very often, you might as well put them into your config file.

slangmod.toml
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true

[files]
raw = "/absolute/path/to/data/files"
suffix = "pqt"
column = "document"
min_doc_len = 32

Note

It does not matter whether you specify the suffix with or without a leading dot. slangmod will act reasonably.

cleaners

For the data that I have been playing with, english E-books from Project Gutenberg (provided as gutenberg-en-v1-clean by BEEspoke Data) and english Wikipedia articles (a subset of Wiki-40B provided by google as wiki40b), I have implemented some actual data cleaning steps. If you plan on using the same or similar data, then maybe they are useful to you as well.

  1. Both, Gutenberg E-books and Wikipedia articles contain “weird” quotes to indicate minutes and seconds (e.g., when giving a location with latitude and longitude). In addition, Gutenberg E-books sometimes use typographical single- and double quotes. I chose to simply replace all of these with normal ‘single’ and “double” quotes, respectively. I you want to do that too, invoke the quotes cleaner on the command line like so:

    slangmod clean --files.cleaners '["quotes"]'
    

    If you want to put that into your config file, extend it like so:

    slangmod.toml
    work_dir = "/absolute/path/to/your/working/directory"
    log_level = 10
    progress = true
    
    [files]
    raw = "/absolute/path/to/data/files"
    suffix = "pqt"
    column = "document"
    min_doc_len = 32
    cleaners = ["quotes"]
    
  2. I decided that I will use the end of a paragraph, that is, two or more consecutive newline characters ("\n\n") as my eos pattern. Gutenberg E-books are already formatted that way. To also format the wiki40b articles (and only those!) that way, you can invoke the wiki40b cleaner like so:

    slangmod clean --files.cleaners '["quotes", "wiki40b"]'
    

    If you want to put that into your config file too, extend it like so:

    slangmod.toml
    work_dir = "/absolute/path/to/your/working/directory"
    log_level = 10
    progress = true
    
    [files]
    raw = "/absolute/path/to/data/files"
    suffix = "pqt"
    column = "document"
    min_doc_len = 32
    cleaners = ["quotes", "wiki40b"]
    
  3. If, like me, you want to start with training a mono-lingual model, then having characters from a script in your corpus that is not the main script of your primary language unnecessarily blows up your vocabulary size. To avoid this, there is a cleaner that replaces all characters that cannot be encoded with a specified encoding (defaults to “cp1252”) with a whitespace. If you want that, you can invoke this cleaner on the command line like so:

    slangmod clean --files.encoding cp1252 --files.cleaners '["quotes", "wiki40b", "encoding"]'
    

    If you want to put that into your config file as well, extend it like so:

    slangmod.toml
    work_dir = "/absolute/path/to/your/working/directory"
    log_level = 10
    progress = true
    
    [files]
    raw = "/absolute/path/to/data/files"
    suffix = "pqt"
    column = "document"
    min_doc_len = 32
    cleaners = ["quotes", "wiki40b", "encoding"]
    encoding = "cp1252"
    

Note

Obviously you can pick any combination and order of these cleaners.

Warning

The cleaners you specify on the command line are not added to those in your config file (or vice versa). Rather, the command line overwrites the entire list in your config file.

Important

Always double check the data that ends up in your “corpus” folder and make sure that it adheres to the expected format.