etl

Tools to clean, preprocess, and re-format your text corpus.

class CorpusCleaner(process, min_len=1, show_progress=True)[source]

Bases: ArgRepr

Clean a single pandas series where each entry represents one document.

Parameters:
  • process (callable) – A callable object that accepts a single (raw) string as input and returns the cleaned string.

  • min_len (int, optional) – The minimum number of characters that a document should have. Shorter documents are filtered out. Defaults to 1.

  • show_progress (bool, optional) – Whether to show a progress bar that provides visual feedback in the console during the cleaning process. Defaults to True.

__call__(corpus)[source]

Apply the cached processor to each document in a corpus.

Parameters:

corpus (Series) – Pandas series with each entry representing a single document to clean, that is, a single string.

Returns:

  • DataFrame – A pandas dataframe with the cleaned series as a sole column.

  • str – The SHA256 hash of the cleaned series.

class EncodingEnforcer(encoding, repl=' ')[source]

Bases: ArgRepr

Force a text into an encoding, replacing unrepresentable characters.

Parameters:
  • encoding (str) – The target encoding. Chose one from the list of built-in codecs.

  • repl (str, optional) – The string to replace unrepresentable characters with. Defaults to a single space (” “).

__call__(text)[source]

Replace unrepresentable characters in the given text.

Parameters:

text (str) – The text to force into the specified encoding by replacing unrepresentable characters.

Returns:

The text in the target encoding with unrepresentable characters replaced.

Return type:

str

class MemoryTrimmer(cdll='libc.so.6')

Bases: ArgRepr, Generic[T, Unpack[Ts]]

Free up memory no longer used by NumPy arrays, PyTorch tensors, etc.

These data structures cannot be reached by the python garbage collector and will block memory even if they cannot be referenced anymore from your code. In this case, memory can often be released by explicitly calling clib’s malloc_trim.

Parameters:

cdll (str, optional) – The name of the standard C dynamic-link library. Defaults to libc.so.6.

__call__(*args)

Free up memory blocked by arrays the garbage collector can’t reach.

Parameters:

*args – To integrate anywhere in your code flow, instances can be called with any number of arguments, including none at all.

Returns:

An empty tuple if called with no arguments. When called with a single argument, that argument. When called with multiple arguments, a tuple of all of these arguments.

Return type:

object or tuple

property libc

The loaded C library.

class RegexReplacer(pattern, repl, flags=0, count=0)[source]

Bases: ArgRepr

Partial of python’s own re.sub function.

Parameters:
  • pattern (str) – Regex pattern to match.

  • repl (str or callable) – String to replace matches with or, if a callable, must accept a Match object (see documentation) and return a string.

  • flags (int, optional) – A flag impacting the regex actions (see documentation). Default to 0, indicating no flag.

  • count (int, optional) – Up to how many occurrences of pattern to replace. Defaults to 0, which results in all occurrences to be replaced.

__call__(text)[source]

Replace matches of the cached regular expression in the text.

Parameters:

text (str) – The text with potential occurrence of the cached pattern.

Returns:

The text with occurrences of pattern replaced by repl.

Return type:

str

class Shuffle(active=True)[source]

Bases: ArgRepr, Generic

Wrapper around random.shuffle.

Parameters:

active (bool, optional) – Flag to switch off shuffling for debugging purposes. Defaults to True.

__call__(sequence)[source]

Shuffle a mutable sequence in place.

Parameters:

sequence (MutableSequence) – The mutable sequence to shuffle in place.

Returns:

The input sequence shuffled in place.

Return type:

MutableSequence

class ToFrame(name, **kwargs)[source]

Bases: ArgRepr

Turn any iterable into a single-column pandas dataframe.

Parameters:
  • name (Hashable) – The name of the single column in the dataframe.

  • **kwargs – Additional keyword arguments are forwarded to the pandas Series constructor.

__call__(iterable)[source]

Convert an iterable into a single-column pandas dataframe.

Parameters:

iterable (iterable) – The object to convert into a single-column pandas dataframe.

Returns:

A pandas dataframe with the contents of the iterable in its only column.

Return type:

DataFrame