pl

Polars utilities and partials of dataframe method calls.

Parameters that are known at program start are used to initialize the classes so that, at runtime, dataframes can flow through a preconfigured processing pipe of callable objects.

class Cast(dtypes, *, strict=True)[source]

Bases: ArgRepr

Partial of the polars dataframe cast method.

Parameters:
  • dtypes (Dtypes) – Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (bool, optional) – Raise if cast is invalid on rows after predicates are pushed down. If False, invalid casts will produce null values. Defaults to True

__call__(df)[source]

Cast column(s) of a polars dataframe to the cached type(s).

Parameters:

df (DataFrame) – The dataframe to type-cast column(s) of.

Returns:

The dataframe with its column(s) cast to the new type(s).

Return type:

DataFrame

class Drop(*columns, strict=True)[source]

Bases: ArgRepr

Partial of the polars dataframe drop method.

Parameters:
  • *columns (ColumnNameOrSelector) – Names of the columns that should be removed from the dataframe. Accepts column selector input.

  • strict (bool, optional) – Validate that all column names exist in the current schema, and throw an exception if any do not. Defaults to True

__call__(df)[source]

Drop columns from a polars dataframe.

Parameters:

df (DataFrame) – The dataframe to drop columns from.

Returns:

The Ddtaframe without the dropped columns.

Return type:

DataFrame

class DropNulls(subset=None)[source]

Bases: ArgRepr

Partial of the polars dataframe drop_nulls method.

Parameters:

subset (Subset) – Column name(s) for which null values are considered. If set to None (default), use all columns.

__call__(df)[source]

Drop rows from a polars dataframe where considered columns are null.

Parameters:

df (DataFrame) – The dataframe to to drop rows from.

Returns:

The dataframe with rows containing null values dropped.

Return type:

DataFrame

class Filter(*predicates, **constraints)[source]

Bases: ArgRepr

Partial of the polars dataframe filter method.

Parameters:
  • *predicates – Expression(s) that evaluate to a boolean Series.

  • **constraints – Filter column(s) given named by the keyword argument itself by the supplied value. Constraints will be implicitly combined with other filters with a logical and.

__call__(df)[source]

Filter dataframe by predicates anf value constraints.

Parameters:

df (DataFrame) – The dataframe to filter.

Returns:

The filtered dataframe.

Return type:

DataFrame

class FromPandas(schema_overrides=None, rechunk=True, nan_to_null=True, include_index=False)[source]

Bases: ArgRepr

Partial of the polars top-level function from_pandas.

Parameters:
  • schema_overrides (dict, optional) – Support override of inferred types for one or more columns. Defaults to None.

  • rechunk (bool, optional) – Make sure that all data is in contiguous memory. Default to True.

  • nan_to_null (bool, optional) – Pyarrow will convert the NaN to None. Default to True.

  • include_index (bool, optional) – Load any non-default pandas indexes as columns. Default to False.

__call__(pandas)[source]

Convert pandas structures into polars series or dataframes

Parameters:

pandas – Dataframe, series, or index to convert to polars.

Returns:

Series if pandas series or index, dataframe otherwise.

Return type:

Series or DataFrame

class GroupBy(*by, maintain_order=False, **named_by)[source]

Bases: ArgRepr

Partial of the polars dataframe group_by method.

Parameters:
  • *by (IntoExpr) – Column(s) to group by. Accepts expression input. Strings are parsed as column names.

  • maintain_order (bool, optional) – Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to True blocks the possibility to run on the streaming engine. Default to False.

  • **named_by (IntoExpr) – Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

__call__(df)[source]

Group a polars dataframe.

Parameters:

df (DataFrame) – The dataframe to group.

Returns:

The grouped dataframe.

Return type:

DataFrame

class GroupByAgg(*aggs, **named_aggs)[source]

Bases: ArgRepr

Partial of a polars (dynamic) group-by object’s agg method.

Parameters:
  • *aggs (IntoExpr) – Aggregations to compute for each group of the group by operation, specified as positional arguments. Accepts expression input. Strings are parsed as column names.

  • **named_aggs (IntoExpr) – Additional aggregations, specified as keyword arguments. The resulting columns will be renamed to the keyword used.

__call__(grouped)[source]

Aggregate a polars (dynamic) group-by object.

Parameters:

grouped (GroupBy or DynamicGroupBy) – The polars (dynamic) group-by object to aggregate.

Returns:

The aggregated (dynamic) group-by object.

Return type:

DataFrame

class GroupByDynamic(index_column, every, period=None, offset=None, include_boundaries=False, closed='left', label='left', group_by=None, start_by='window')[source]

Bases: ArgRepr

Partial of the polars dataframe group_by_dynamic method.

Parameters:
  • index_column (IntoExpr) – Column used to group based on the time window. Often of type Date or Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group). In case of a dynamic group by on indices, dtype needs to be Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

  • every (str or timedelta) – Interval of the window. Suffix string of integer number with the letter “i” to indicate indexing by integer columns.

  • period (str or timedelta, optional) – Length of the window. Equals ‘every’ if set to None (the default).

  • offset (str or timedelta) – Offset of the window. Does not take effect if start_by is “datapoint”. Defaults to zero.

  • include_boundaries (bool, optional) – Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it is harder to parallelize. Defaults to False.

  • closed ("left", "right", "both", "none") – Define which sides of the temporal interval are closed (inclusive).

  • label ("left", "right", "datapoint") – Which label to use for the window, lower boundary, upper boundary, or first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance.

  • group_by (IntoExpr, optional) – Also group by this column/these columns. Defaults to None.

  • start_by ("window", "datapoint", "monday", "tuesday", ...) – The strategy to determine the start of the first window by, where “window” takes the earliest timestamp, truncates it with every, and then adds offset. Weekly windows start on Monday. “datapoint” starts from the first encountered data point, whereas any day of the week starts the window at the weekday before the first data point. The resulting window is then shifted back until the earliest datapoint is in or in front of it.

__call__(df)[source]

Evaluate rolling-window aggregations on a polars dataframe.

Parameters:

df (DataFrame) – The dataframe to compute rolling-window aggregations on.

Returns:

The rolling-window aggregations.

Return type:

DataFrame

class Join(on=None, how='inner', left_on=None, right_on=None, suffix='_right', validate='m:m', nulls_equal=False, coalesce=None, maintain_order=None)[source]

Bases: ArgRepr

Partial of the polars dataframe join method.

Parameters:
  • on (str) – Name(s) of the join columns in both DataFrames. If set, left_on and right_on should be None. Should not be specified if how is “cross”. Defaults to None.

  • how ("inner", "left", "right", "full", "semi", "anti", "cross") – Join strategy.

  • left_on (str, optional) – Name(s) of the left join column(s). Defaults to None.

  • right_on (str, optional) – Name(s) of the right join column(s). Defaults to None.

  • suffix (str, optional) – Suffix to append to columns with a duplicate name. Defaults to “_right”.

  • validate ("m:m", "m:1", "1:m", "1:1") – Checks if join is of specified type, many-to-many, many-to-one, one-to_many, or one-to-one.

  • nulls_equal (bool, optional) – Join on null values. By default, null values will never produce matches. Defaults to False.

  • coalesce (bool, optional) – Coalescing behavior (merging of join columns). Defaults to None, which leaves the behaviour join specific.

  • maintain_order ("none", "left", "right", "left_right", "right_left") – Which dataframe row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance Supported for inner, left, right and full joins.

__call__(left, right)[source]

Join two polars dataframes.

Parameters:
  • left (DataFrame) – Left dataframe in the join.

  • right (DataFrame) – Right dataframe in the join.

Returns:

The joined dataframes.

Return type:

DataFrame

class Pivot(on, on_columns=None, **kwargs)[source]

Bases: ArgRepr

Partial of the polars dataframe pivot method.

Parameters:
  • on (ColumnNameOrSelector) – The column(s) whose values will be used as the new columns of the output dataframe.

  • on_columns (Sequence or None) – What value combinations will be considered for the output table.

__call__(df)[source]

Pivot a polars dataframe with the cached (keyword) arguments.

Parameters:

df (DataFrame) – The dataframe to pivot.

Returns:

The pivoted dataframe.

Return type:

DataFrame

class Rename(mapping, *, strict=True)[source]

Bases: ArgRepr

Partial of the polars dataframe rename method.

Parameters:
  • mapping (Mapping or Callable) – Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.

  • strict (bool, optional) – Validate that all column names exist in the current schema, and throw an exception if any do not. Defaults to True.

__call__(df)[source]

Rename a polars dataframe’s columns.

Parameters:

df (DataFrame) – The dataframe to rename columns of.

Returns:

The dataframe with renamed columns.

Return type:

DataFrame

class Select(*exprs, **named_exprs)[source]

Bases: ArgRepr

Partial of the polars dataframe select method.

Parameters:
  • *exprs (IntoExpr) – Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • **named_exprs (IntoExpr) – Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

__call__(df)[source]

Select columns from a polars dataframe.

Parameters:

df (DataFrame) – The dataframe to select columns from.

Returns:

The selected columns.

Return type:

DataFrame

class Sort(by, *more_by, descending=False, nulls_last=False, multithreaded=True, maintain_order=False)[source]

Bases: ArgRepr

Partial of the polars dataframe sort method.

Parameters:
  • by (IntoExpr) – Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.

  • *more_by (IntoExpr) – Additional columns to sort by, specified as positional arguments.

  • descending (bool, optional) – Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans. Defaults to False.

  • nulls_last (bool, optional) – Place null values last. Can be a single boolean applying to all columns or a sequence of booleans for per-column control. Defaults to False

  • multithreaded (bool, optional) – Sort using multiple threads. Defaults to True.

  • maintain_order (bool, optional) – Whether the order should be maintained if elements are equal. Defaults to False.

__call__(df)[source]

Sort polars dataframe by column values.

Parameters:

df (DataFrame) – The dataframe to sort.

Returns:

The sorted dataframe.

Return type:

DataFrame

class ToPandas(use_pyarrow_extension_array=False, **kwargs)[source]

Bases: ArgRepr

Partial of the polars dataframe to_pandas method.

Parameters:
  • use_pyarrow_extension_array (bool, optional) – Use pyarrow-backed extension arrays instead of numpy arrays for the columns of the pandas dataframe. This allows zero copy operations and preservation of null values. Subsequent operations on the resulting pandas dataframe may trigger conversion to numpy if those operations are not supported by pyarrow compute. Defaults to False.

  • **kwargs – Additional keyword arguments to be passed to pyarrow.Table.to_pandas().

__call__(df)[source]

Convert a polars dataframe into a pandas one.

Parameters:

df (DataFrame) – The polars dataframe to convert.

Returns:

The converted pandas dataframe.

Return type:

PandasFrame

class VStack(in_place=False)[source]

Bases: ArgRepr

Partial of the polars dataframe vstack method.

Parameters:

in_place (bool, optional) – Whether to modify in place. Defaults to False.

__call__(upper, lower)[source]

Stack to polars dataframes on top of each other.

Parameters:
  • upper (DataFrame) – The upper dataframe to be appended to.

  • lower (DataFrame) – The lower dataframe being appended to upper.

Returns:

The upper and lower dataframes stacked in top of each other.

Return type:

DataFrame

class WithColumns(*exprs, **named_exprs)[source]

Bases: ArgRepr

Partial of the polars dataframe with_columns method.

Parameters:
  • *exprs (IntoExpr) – Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • **named_exprs (IntoExpr) – Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

__call__(df)[source]

Add or replace columns to/of a polars dataframe.

Parameters:

df (DataFrame) – The dataframe to add or replace columns to/of.

Returns:

The dataframe with columns added or replaced.

Return type:

DataFrame

io

class Parquet2LazyFrame(path='', storage=LazyStorage.FILE, storage_kws=None, **kwargs)[source]

Bases: LazyReader

Lazily scan a parquet file on any supported file system.

Parameters:
  • path (str, optional) – Base directory or full path to the parquet file. Since part of it can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.

  • storage (str, optional) – The type of file system to read from (“file”, “s3”, “gcs”, etc.). Defaults to “file”. Use the LazyStorage enum to avoid typos.

  • storage_kws (dict, optional) – Passed on as storage_options to polars.scan_parquet().

  • **kwargs – Passed on as additional keyword arguments to polar’s top-level scan_parquet() function. See the scan documentation for available options.

Raises:
  • TypeError – If path is not a string or storage_kws is not a dictionary.

  • ValueError – If storage is not among the currently supported file-system schemes.

See also

LazyStorage

__call__(path='')[source]

Lazily scan a parquet file on the specified file system.

Parameters:

path (str, optional) – Path (including file name) to the parquet file to scan. If it starts with a forward slash, it is interpreted as absolute; otherwise, it is joined to the path given at instantiation. Defaults to an empty string, which leaves the instantiation path unchanged.

Returns:

A Polars LazyFrame backed by the specified parquet file. No data is read until the frame is collected or sinked.

Return type:

LazyFrame

Raises:

ValueError – If the final path is directly under root (e.g., “/file.parquet”) because, on local file system, this is not where you want to save to and, on object storage, the first directory refers to the name of an (existing!) bucket.

class LazyFrame2Parquet(path, storage=LazyStorage.FILE, storage_kws=None, **kwargs)[source]

Bases: LazyWriter

Sink a polars lazy frame to a parquet file on any supported file system.

Parameters:
  • path (str) – The absolute path to the parquet file to write. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when the instance is called.

  • storage (str, optional) – The type of file system to write to (“file”, “s3”, “gcs”, etc.). Defaults to “file”. Use the LazyStorage enum to avoid typos.

  • storage_kws (dict, optional) – Passed as storage_options to polars.LazyFrame.sink_parquet().

  • **kwargs – Passed on as additional keyword arguments to polars.LazyFrame.sink_parquet(). See the sink documentation for available options.

Raises:
  • TypeError – If path is not a string or storage_kws is not a dictionary.

  • ValueError – If storage is not among the currently supported file-system schemes.

See also

LazyStorage

Note

sink_parquet requires a streaming-compatible query plan. Ensure your lazy query is compatible before calling. Polars will raise if it is not.

__call__(ldf, *parts)[source]

Sink a polars lazy frame to a parquet file.

Parameters:
  • ldf (LazyFrame) – The polars lazy frame to sink.

  • *parts (str) – Fragments that will be interpolated into the path given at instantiation. Obviously, there must be at least as many as there are placeholders in the path.

Returns:

An empty tuple.

Return type:

tuple

Raises:
  • IndexError – If the path given at instantiation has more string placeholders that there are parts.

  • ValueError – If the final path is directly under root (e.g., “/file.parquet”) because, on local file system, this is not where you want to save to and, on object storage, the first directory refers to the name of an (existing!) bucket.

class LazyStorage(*values)[source]

Bases: StrEnum

FILE = file
S3 = s3
GCS = gs
AZURE = az
HF = hf

Base classes

class LazyReader(path='', storage=LazyStorage.FILE, storage_kws=None, *args, **kwargs)[source]

Bases: ArgRepr

Base class for scanning polars lazy frames from any filesystem.

Parameters:
  • path (str, optional) – Directory under which the parquet file is located or its full path. Since it (or part of it) can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.

  • storage (str, optional) – The type of file system to scan from (“file”, “s3”, etc.). Defaults to “file”. Use the LazyStorage enum to avoid typos.

  • storage_kws (dict, optional) – Passed on as storage_options to polars’ scan methods.

  • *args – Additional arguments are reflected in the representation of instances but do not affect functionality in any way.

  • **kwargs – Additional keyword arguments are reflected in the representation of instances but do not affect functionality in any way.

Raises:
  • TypeError – If path is not a string or storage_kws is not a dictionary.

  • ValueError – If storage is not among the currently supported file-system schemes.

See also

LazyStorage

_non_root(path='')[source]

Assemble and validate the URI, raising if it points to root.

property prefix

The URI prefix for the selected storage backend.

class LazyWriter(path, storage=LazyStorage.FILE, storage_kws=None, *args, **kwargs)[source]

Bases: ArgRepr

Base class for sinks of polars lazy frames on any filesystem.

Parameters:
  • path (str) – The absolute path to the file to sink. May contain any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.

  • storage (str, optional) – The type of file system to write to (“file”, “s3”, etc.). Defaults to “file”. Use the LazyStorage enum to avoid typos.

  • storage_kws (dict, optional) – Passed on as keyword arguments both to the fsspec filesystem constructor and as storage_options to polars’ sink methods.

  • *args – Additional arguments are reflected in the representation of instances but do not affect functionality in any way.

  • **kwargs – Additional keyword arguments are reflected in the representation of instances but do not affect functionality in any way.

Raises:
  • TypeError – If path is not a string or storage_kws is not a dictionary.

  • ValueError – If storage is not among the currently supported file-system schemes.

See also

LazyStorage

_uri_from(*parts)[source]

Check skip/overwrite and create parent directories.

property prefix

The URI prefix for the selected storage backend.