etl
Tools to clean, preprocess, and re-format your text corpus.
- class CorpusCleaner(process, min_len=1, show_progress=True)[source]
Bases:
ArgReprClean a single pandas series where each entry represents one document.
- Parameters:
process (callable) – A callable object that accepts a single (raw) string as input and returns the cleaned string.
min_len (int, optional) – The minimum number of characters that a document should have. Shorter documents are filtered out. Defaults to 1.
show_progress (bool, optional) – Whether to show a progress bar that provides visual feedback in the console during the cleaning process. Defaults to
True.
- __call__(corpus)[source]
Apply the cached processor to each document in a corpus.
- Parameters:
corpus (Series) – Pandas series with each entry representing a single document to clean, that is, a single string.
- Returns:
DataFrame – A pandas dataframe with the cleaned series as a sole column.
str – The SHA256 hash of the cleaned series.
- class EncodingEnforcer(encoding, repl=' ')[source]
Bases:
ArgReprForce a text into an encoding, replacing unrepresentable characters.
- Parameters:
encoding (str) – The target encoding. Chose one from the list of built-in codecs.
repl (str, optional) – The string to replace unrepresentable characters with. Defaults to a single space (” “).
- class MemoryTrimmer(cdll='libc.so.6')
Bases:
ArgRepr,Generic[T,Unpack[Ts]]Free up memory no longer used by NumPy arrays, PyTorch tensors, etc.
These data structures cannot be reached by the python garbage collector and will block memory even if they cannot be referenced anymore from your code. In this case, memory can often be released by explicitly calling clib’s
malloc_trim.- Parameters:
cdll (str, optional) – The name of the standard C dynamic-link library. Defaults to libc.so.6.
- __call__(*args)
Free up memory blocked by arrays the garbage collector can’t reach.
- Parameters:
*args – To integrate anywhere in your code flow, instances can be called with any number of arguments, including none at all.
- Returns:
An empty tuple if called with no arguments. When called with a single argument, that argument. When called with multiple arguments, a tuple of all of these arguments.
- Return type:
object or tuple
- property libc
The loaded C library.
- class RegexReplacer(pattern, repl, flags=0, count=0)[source]
Bases:
ArgReprPartial of python’s own
re.subfunction.- Parameters:
pattern (str) – Regex pattern to match.
repl (str or callable) – String to replace matches with or, if a callable, must accept a
Matchobject (see documentation) and return a string.flags (int, optional) – A flag impacting the regex actions (see documentation). Default to 0, indicating no flag.
count (int, optional) – Up to how many occurrences of pattern to replace. Defaults to 0, which results in all occurrences to be replaced.
- class Shuffle(active=True)[source]
Bases:
ArgRepr,GenericWrapper around random.shuffle.
- Parameters:
active (bool, optional) – Flag to switch off shuffling for debugging purposes. Defaults to
True.
- class ToFrame(name, **kwargs)[source]
Bases:
ArgReprTurn any iterable into a single-column pandas dataframe.
- Parameters:
name (Hashable) – The name of the single column in the dataframe.
**kwargs – Additional keyword arguments are forwarded to the pandas
Seriesconstructor.