mix

Combine your embedded features into vectors of size model-dimension.

For interpretability, feature importance is available in all mixer flavors.

class CrossAttentionMixer(mod_dim, n_heads=1, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]

Bases: Mixer

Combine stacked feature vectors via cross-attention with learned query.

Similar to, but lighter than self-attention and arguably cleaner as a mixer, since the mixing intent is decoupled from the input content.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_heads (int, optional) – Number of attention heads. Must evenly divide mod_dim. Defaults to 1.

  • bias (bool, optional) – Whether to add learnable bias vectors in the attention projections. Defaults to True.

  • dropout (float, optional) – The amount of dropout to apply to the attention weights as well as to the mixed-features output. Defaults to 0.

  • skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to True.

  • keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to False.

  • device (str or torch.device, optional) – Torch device to first create the mixer on. Defaults to “cpu”.

  • dtype (torch.dtype, optional) – Torch dtype to first create the mixer in. Defaults to torch.float.

property device

The device all weights, biases, activations, etc. reside on.

property dtype

The dtype of all weights, biases, activations, and parameters.

forward(inp, mask=None)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask, True values indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults to None.

Returns:

Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the cross-attention-pooled combination of all feature vectors.

Return type:

Tensor

importance(inp, mask=None)[source]

Per-feature attention weights derived from the learned query.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask, True values indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults to None.

Returns:

Attention weights derived from the learned query. The output tensor has one fewer dimensions than the inp with the last dimension being dropped.

Return type:

Tensor

property mod_dim

The embedding size.

new()[source]

A fresh, new, re-initialized instance with identical parameters.

Returns:

A fresh, new instance of itself.

Return type:

CrossAttentionMixer

reset_parameters()[source]

Re-initialize all internal parameters.

class GlobalWeightsMixer(mod_dim, n_features, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]

Bases: Mixer

Combine stacked feature vectors through a learnable linear combination.

A single, global set of linear-combination coefficients is learned and shared across all instances. The coefficients sum to 1 via softmax and can thus be seen as some sort of global feature importance.

Parameters:
  • mod_dim (int) – Ignored but mandatory to maintain API compatibility.

  • n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

  • dropout (float, optional) – The amount of dropout to apply to the mixed-features output. Defaults to 0.

  • skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to True.

  • keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to False.

  • device (str or torch.device, optional) – Torch device to first create the mixer on. Defaults to “cpu”.

  • dtype (torch.dtype, optional) – Torch dtype to first create the mixer in. Defaults to torch.float.

property device

The device all weights, biases, activations, etc. reside on.

property dtype

The dtype of all weights, biases, activations, and parameters.

forward(inp, mask=None)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the feature vectors.

  • mask (Tensor or None) – Ignored for the GlobalWeightsMixer.

Returns:

Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the globally weighted linear combination of all feature vectors.

Return type:

Tensor

importance(inp, mask=None)[source]

Learned global feature weights in the normed sum over all features.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the feature vectors.

  • mask (Tensor or None) – Ignored for the GlobalWeightsMixer.

Returns:

The softmax-normalized coefficients broadcast to the shape of inp with the last dimension dropped, i.e. (..., n_features).

Return type:

Tensor

property mod_dim

The embedding size.

property n_features

The number of features in the mix.

new()[source]

A fresh, new, re-initialized instance with identical parameters.

Returns:

A fresh, new instance of itself.

Return type:

GlobalMixer

reset_parameters()[source]

Re-initialize the coefficients for the linear combination.

class InstanceWeightsMixer(mod_dim, n_features, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]

Bases: Mixer

Combine stacked feature vectors by a per-instance linear combination.

The per-instance coefficients sum to 1 for each data point and can thus be seen as some sort of per-instance feature importance. They are obtained by concatenating all features into a single, wide vector, linearly projecting down to a vector with the same number of elements as there are features to combine, and then applying a softmax.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

  • bias (bool, optional) – Whether to add a learnable bias vector in the projection. Defaults to True.

  • dropout (float, optional) – The amount of dropout to apply to the mixed-features output. defaults to 0.

  • skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to True.

  • keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to False.

  • device (str or torch.device, optional) – Torch device to first create the embedder on. Defaults to “cpu”.

  • dtype (torch.dtype, optional) – Torch dtype to first create the embedder in. Defaults to torch.float.

property device

The device of all weights, biases, activations, etc. reside on.

property dtype

The dtype of all weights, biases, activations, and parameters.

forward(inp, mask=None)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None) – Ignored for the InstanceWeightsMixer.

Returns:

Depending on keep_dim, the output tensor has the same number of dimensions as the inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the per-instance (normed) linear combination of all feature vectors.

Return type:

Tensor

importance(inp, mask=None)[source]

Per-instance weights in the normed linear combination of features.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None) – Ignored for the InstanceWeightsMixer.

Returns:

The output tensor has one fewer dimensions than the inp with the last dimension being dropped.

Return type:

Tensor

property mod_dim

The embedding size.

property n_features

The number of features in the mix.

new()[source]

A fresh, new, re-initialized instance with identical parameters.

Returns:

A fresh, new instance of itself.

Return type:

ActivatedMixer

reset_parameters()[source]

Re-initialize all internal parameters.

class SelfAttentionMixer(mod_dim, n_heads=1, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]

Bases: Mixer

Combine stacked feature vectors via multi-head self-attention pooling.

Each feature vector attends to all others, and the resulting attended representations are averaged across the feature dimension to yield a single output vector per instance. The attention weights averaged over all query positions serve as per-instance feature importance scores.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_heads (int, optional) – Number of attention heads. Must evenly divide mod_dim. Defaults to 1.

  • bias (bool, optional) – Whether to add learnable bias vectors in the attention projections. Defaults to True.

  • dropout (float, optional) – The amount of dropout to apply to the attention weights as well as to the mixed-features output. Defaults to 0.

  • skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to True.

  • keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to False.

  • device (str or torch.device, optional) – Torch device to first create the embedder on. Defaults to “cpu”.

  • dtype (torch.dtype, optional) – Torch dtype to first create the embedder in. Defaults to torch.float.

property device

The device of all weights, biases, activations, etc. reside on.

property dtype

The dtype of all weights, biases, activations, and parameters.

forward(inp, mask=None)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask, True values indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults to None.

Returns:

Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the attention-pooled combination of all feature vectors.

Return type:

Tensor

importance(inp, mask=None)[source]

Per-instance weights in the attention-based combination of features.

Parameters:
  • inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

  • mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask, True values indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults to None.

Returns:

Attention weights averaged over query positions. The output tensor has one fewer dimensions than the inp with the last dimension being dropped.

Return type:

Tensor

property mod_dim

The embedding size.

new()[source]

A fresh, new, re-initialized instance with identical parameters.

Returns:

A fresh, new instance of itself.

Return type:

SelfAttentionMixer

reset_parameters()[source]

Re-initialize all internal parameters.

Base class

class Mixer(*_, **__)[source]

Bases: Block

abstractmethod importance(inp, mask=None)[source]

Return per-instance feature importance.