clean
In order to invoke the clean command of slangmod, you need to specify
the location of your raw data files. You might have seen
this command-line option when you tried out the CLI for the first time during
basic configuration:
"files": {
"raw": "/directory/where/you/invoked/slangmod",
"suffix": "parquet",
"column": "text",
"min_doc_len": 1,
"cleaners": [],
"encoding": "cp1252"
}
The raw field, defaulting to whichever directory you invoke slangmod
in, should point towards the folder where your data files are located. As you
can see, however, this field is nested inside of the files struct. So
how do you set this from the command line? Easy. To access nested config
fields just append their names to the top-level field with a dot like so:
slangmod clean --files.raw relative/or/absolute/path/to/data/files
Because the location of this directory is probably not going to change that frequently, it might be a good idea to put it into your config file, again preferring absolute paths over relative ones.
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true
[files]
raw = "/absolute/path/to/data/files"
Invoking slangmod as described above will do three things:
It will copy all files with the extension “.parquet” that contain either “train”, or “test”, or “validation” in their names from the
rawfolder into the “corpus” subdirectory inside yourwork_dir. It will not descend into any subfolders ofraw.In doing so, it will filter out documents that are shorter than
min_doc_lencharacters. Its value defaults to 1 to drop empty documents.It will rename your data files with a hash of what is inside them to avoid duplicates.
Warning
Every time you invoke slangmod clean the “corpus” folder inside your
work_dir will be completely emptied and re-filled from scratch.
To add more data files instead, you must resume cleaning like so:
slangmod resume clean
options
What you can also see is that this is where you can specify the suffix
you use for your parquet files (defaults to “parquet”) and the column in
your data table that contains the actual text (defaults to “text”). To set
these explicitly on the command line, you would go:
slangmod clean --files.suffix pqt --files.column document --files.min_doc_len 32
Because again, these options are not going to change very often, you might as well put them into your config file.
work_dir = "/absolute/path/to/your/working/directory"
log_level = 10
progress = true
[files]
raw = "/absolute/path/to/data/files"
suffix = "pqt"
column = "document"
min_doc_len = 32
Note
It does not matter whether you specify the suffix with or without a
leading dot. slangmod will act reasonably.
cleaners
For the data that I have been playing with, english E-books from Project Gutenberg (provided as gutenberg-en-v1-clean by BEEspoke Data) and english Wikipedia articles (a subset of Wiki-40B provided by google as wiki40b), I have implemented some actual data cleaning steps. If you plan on using the same or similar data, then maybe they are useful to you as well.
Both, Gutenberg E-books and Wikipedia articles contain “weird” quotes to indicate minutes and seconds (e.g., when giving a location with latitude and longitude). In addition, Gutenberg E-books sometimes use typographical single- and double quotes. I chose to simply replace all of these with normal ‘single’ and “double” quotes, respectively. I you want to do that too, invoke the
quotescleaner on the command line like so:slangmod clean --files.cleaners '["quotes"]'
If you want to put that into your config file, extend it like so:
slangmod.tomlwork_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes"]
I decided that I will use the end of a paragraph, that is, two or more consecutive newline characters (
"\n\n") as my eos pattern. Gutenberg E-books are already formatted that way. To also format the wiki40b articles (and only those!) that way, you can invoke thewiki40bcleaner like so:slangmod clean --files.cleaners '["quotes", "wiki40b"]'
If you want to put that into your config file too, extend it like so:
slangmod.tomlwork_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes", "wiki40b"]
If, like me, you want to start with training a mono-lingual model, then having characters from a script in your corpus that is not the main script of your primary language unnecessarily blows up your vocabulary size. To avoid this, there is a cleaner that replaces all characters that cannot be encoded with a specified
encoding(defaults to “cp1252”) with a whitespace. If you want that, you can invoke this cleaner on the command line like so:slangmod clean --files.encoding cp1252 --files.cleaners '["quotes", "wiki40b", "encoding"]'
If you want to put that into your config file as well, extend it like so:
slangmod.tomlwork_dir = "/absolute/path/to/your/working/directory" log_level = 10 progress = true [files] raw = "/absolute/path/to/data/files" suffix = "pqt" column = "document" min_doc_len = 32 cleaners = ["quotes", "wiki40b", "encoding"] encoding = "cp1252"
Note
Obviously you can pick any combination and order of these cleaners.
Warning
The cleaners you specify on the command line are not added to those in your config file (or vice versa). Rather, the command line overwrites the entire list in your config file.
Important
Always double check the data that ends up in your “corpus” folder and make sure that it adheres to the expected format.