io
Tools for IO-related tasks like saving to and loading from disk.
User input and model output is handled by clients module.
- class CorpusDiscovery(folder='', *file_types, suffix='parquet', not_found=NotFound.RAISE)[source]
Bases:
ArgReprDiscover files in a given directory and filter by name and suffix.
- Parameters:
folder (str, optional) – Parent directory to search for files. Subdirectories can be specified when calling instances. Defaults to the working directory of the current python interpreter.
*file_types (str, optional) – File names must contain at least one of these strings.
suffix (str, optional) – Extension glob pattern that files must match (without leading dot). Defaults to “parquet”.
not_found (str, optional) – What to do if either the directory does not exist or no matching files are found in the given directory. One of “ignore”, “warn”, or “raise”. Use the NotFound enum to avoid typos. Defaults to “raise”. If set otherwise, an empty tuple of file names might be returned.
- __call__(subfolder='')[source]
Chose subdirectory and filter names of files found therein.
- Parameters:
subfolder (str, optional) – Subdirectory relative to the parent given at instantiation. Defaults to an empty string, resulting in the that parent directory to be searched.
- Returns:
Fully resolved names of files that match the given criteria from within the specified directory.
- Return type:
list
- Raises:
FileNotFoundError – Only if not_found is set to “raise”, and then only if either the directory was not found or no files matching the specified criteria were found in that directory.
- class CorpusFilter(part)[source]
Bases:
ArgReprDetermine whether a given string is part of a fully resolved file name.
- Parameters:
part (str) – Part of the file name to filter for.
- class CorpusLoader(reader)[source]
Bases:
ArgReprRead files with multiple documents and provide an iterator over all.
- Parameters:
reader (callable) – Must return some sort of iterable over documents (=strings), when given a file name.
- class DirectoryCleaner(folder, return_path=False)[source]
Bases:
ArgReprProvide a fresh, empty target directory to write to.
- Parameters:
folder (str) – Parent directory to clean or create. Subdirectories can be specified when calling instances.
return_path (bool, optional) – Whether to return the fully resolved path to the emptied or created directory or not when instances are called. Defaults to
False
- __call__(subfolder='')[source]
Delete the contents of an existing directory or create a new one.
- Parameters:
subfolder (str, optional) – Subdirectory relative to the parent given at instantiation. Defaults to an empty string, resulting in the that parent directory being emptied or created.
- Returns:
An empty tuple or rhe fully resolved path to the emptied or created directory, depending on whether return_path is set to
FalseorTrue.- Return type:
tuple or str
- class FileTypeExtractor(file_type, *file_types)[source]
Bases:
ArgReprDetermine if a file name contains one and only one of the given strings.
- Parameters:
file_type (str) – String to test the file name for.
*file_types (str) – Additional string to test the file name for.
- __call__(path)[source]
Determine if and, if so, which string a file name contains once.
- Parameters:
path (str) – A (fully resolved) file name, potentially including forward slashed and (sub-)directories.
- Returns:
The one file type of the given file_types that is contained in the stem of the file name one or more times.
- Return type:
str
- Raises:
ValueError – If none of the cached file_types are contained in the stem of the given file name or if it contains more than one.
- class TokenizerLoader(algo, path='')[source]
Bases:
ArgRepr,GenericLoad a previously saved Tokenizer or Algo from file.
- Parameters:
algo (Tokenizer or Algo) – A fresh, trained, or tainted instance of a tokenizer or an Algo.
path (str, optional) – Full or partial path to the model to load. If not fully specified here, it can be completed on calling the instance. Defaults to the current working directory of the python interpreter.
- __call__(path='')[source]
Load a previously saved Tokenizer or Algo from file
- Parameters:
path (str, optional) – Path (including file name) to the file to load. If it starts with a backslash, it will be interpreted as absolute, if not, as relative to the path specified at instantiation. Defaults to an empty string, which results in an unchanged path.
- Returns:
A new instance of the same type as the algo provided at instantiation with its internal parameters set to what was read from file.
- Return type:
Tokenizer or Algo
- class TokenizerSaver(path='', create=False)[source]
Bases:
ArgReprConvenience wrapper around a Tokenizer’s or Algo’s
savemethod.- Parameters:
path (str) – Path (including file name) to save the tokenizer to. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to the current working directory of the python interpreter.
create (bool, optional) – What to do if the directory where the tokenizer should be saved does not exist. Defaults to
False.
- __call__(algo, *parts)[source]
Save a Tokenizer or Algo to file.
- Parameters:
algo (Tokenizer or Algo) – The tokenizer to save.
*parts (str, optional) – Fragments that will be interpolated into the path string given at instantiation. Obviously, there must be at least as many as there are placeholders in the path.
- Returns:
An empty tuple.
- Return type:
tuple
- extract_file_name(path)[source]
Extract the bare file name from a fully resolved path to a file.
- Parameters:
path (str) – Fully resolved path to a file.
- Returns:
The bare file name without the leading slashes and (sub-)directories.
- Return type:
str