io

Tools for IO-related tasks like saving to and loading from disk.

User input and model output is handled by clients module.

Note

For more tools, see also the pandas and text sections of the swak documentation.

class CorpusDiscovery(folder='', *file_types, suffix='parquet', not_found=NotFound.RAISE)[source]

Bases: ArgRepr

Discover files in a given directory and filter by name and suffix.

Parameters:
  • folder (str, optional) – Parent directory to search for files. Subdirectories can be specified when calling instances. Defaults to the working directory of the current python interpreter.

  • *file_types (str, optional) – File names must contain at least one of these strings.

  • suffix (str, optional) – Extension glob pattern that files must match (without leading dot). Defaults to “parquet”.

  • not_found (str, optional) – What to do if either the directory does not exist or no matching files are found in the given directory. One of “ignore”, “warn”, or “raise”. Use the NotFound enum to avoid typos. Defaults to “raise”. If set otherwise, an empty tuple of file names might be returned.

__call__(subfolder='')[source]

Chose subdirectory and filter names of files found therein.

Parameters:

subfolder (str, optional) – Subdirectory relative to the parent given at instantiation. Defaults to an empty string, resulting in the that parent directory to be searched.

Returns:

Fully resolved names of files that match the given criteria from within the specified directory.

Return type:

list

Raises:

FileNotFoundError – Only if not_found is set to “raise”, and then only if either the directory was not found or no files matching the specified criteria were found in that directory.

class CorpusFilter(part)[source]

Bases: ArgRepr

Determine whether a given string is part of a fully resolved file name.

Parameters:

part (str) – Part of the file name to filter for.

__call__(file)[source]

Determine whether the cached string is part of the file name.

Parameters:

file (str) – Name of the file to test. Can include parent folder(s).

Returns:

Whether the cached part occurs in the file name at least once.

Return type:

bool

class CorpusLoader(reader)[source]

Bases: ArgRepr

Read files with multiple documents and provide an iterator over all.

Parameters:

reader (callable) – Must return some sort of iterable over documents (=strings), when given a file name.

__call__(files)[source]

Read files with multiple documents and provide an iterator over all.

Parameters:

files (iterable over str) – Names of files to chain documents from.

Returns:

An itertools.chain iterator over all documents from all files.

Return type:

Iterator

class DirectoryCleaner(folder, return_path=False)[source]

Bases: ArgRepr

Provide a fresh, empty target directory to write to.

Parameters:
  • folder (str) – Parent directory to clean or create. Subdirectories can be specified when calling instances.

  • return_path (bool, optional) – Whether to return the fully resolved path to the emptied or created directory or not when instances are called. Defaults to False

__call__(subfolder='')[source]

Delete the contents of an existing directory or create a new one.

Parameters:

subfolder (str, optional) – Subdirectory relative to the parent given at instantiation. Defaults to an empty string, resulting in the that parent directory being emptied or created.

Returns:

An empty tuple or rhe fully resolved path to the emptied or created directory, depending on whether return_path is set to False or True.

Return type:

tuple or str

class FileTypeExtractor(file_type, *file_types)[source]

Bases: ArgRepr

Determine if a file name contains one and only one of the given strings.

Parameters:
  • file_type (str) – String to test the file name for.

  • *file_types (str) – Additional string to test the file name for.

__call__(path)[source]

Determine if and, if so, which string a file name contains once.

Parameters:

path (str) – A (fully resolved) file name, potentially including forward slashed and (sub-)directories.

Returns:

The one file type of the given file_types that is contained in the stem of the file name one or more times.

Return type:

str

Raises:

ValueError – If none of the cached file_types are contained in the stem of the given file name or if it contains more than one.

class TokenizerLoader(algo, path='')[source]

Bases: ArgRepr, Generic

Load a previously saved Tokenizer or Algo from file.

Parameters:
  • algo (Tokenizer or Algo) – A fresh, trained, or tainted instance of a tokenizer or an Algo.

  • path (str, optional) – Full or partial path to the model to load. If not fully specified here, it can be completed on calling the instance. Defaults to the current working directory of the python interpreter.

__call__(path='')[source]

Load a previously saved Tokenizer or Algo from file

Parameters:

path (str, optional) – Path (including file name) to the file to load. If it starts with a backslash, it will be interpreted as absolute, if not, as relative to the path specified at instantiation. Defaults to an empty string, which results in an unchanged path.

Returns:

A new instance of the same type as the algo provided at instantiation with its internal parameters set to what was read from file.

Return type:

Tokenizer or Algo

class TokenizerSaver(path='', create=False)[source]

Bases: ArgRepr

Convenience wrapper around a Tokenizer’s or Algo’s save method.

Parameters:
  • path (str) – Path (including file name) to save the tokenizer to. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to the current working directory of the python interpreter.

  • create (bool, optional) – What to do if the directory where the tokenizer should be saved does not exist. Defaults to False.

__call__(algo, *parts)[source]

Save a Tokenizer or Algo to file.

Parameters:
  • algo (Tokenizer or Algo) – The tokenizer to save.

  • *parts (str, optional) – Fragments that will be interpolated into the path string given at instantiation. Obviously, there must be at least as many as there are placeholders in the path.

Returns:

An empty tuple.

Return type:

tuple

extract_file_name(path)[source]

Extract the bare file name from a fully resolved path to a file.

Parameters:

path (str) – Fully resolved path to a file.

Returns:

The bare file name without the leading slashes and (sub-)directories.

Return type:

str