mix
Combine your embedded features into vectors of size model-dimension.
For interpretability, feature importance is available in all mixer flavors.
- class CrossAttentionMixer(mod_dim, n_heads=1, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]
Bases:
MixerCombine stacked feature vectors via cross-attention with learned query.
Similar to, but lighter than self-attention and arguably cleaner as a mixer, since the mixing intent is decoupled from the input content.
- Parameters:
mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.
n_heads (int, optional) – Number of attention heads. Must evenly divide mod_dim. Defaults to 1.
bias (bool, optional) – Whether to add learnable bias vectors in the attention projections. Defaults to
True.dropout (float, optional) – The amount of dropout to apply to the attention weights as well as to the mixed-features output. Defaults to 0.
skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to
True.keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to
False.device (str or torch.device, optional) – Torch device to first create the mixer on. Defaults to “cpu”.
dtype (torch.dtype, optional) – Torch dtype to first create the mixer in. Defaults to
torch.float.
See also
- property device
The device all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(inp, mask=None)[source]
Forward pass for combining multiple stacked feature vectors.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask,
Truevalues indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults toNone.
- Returns:
Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the cross-attention-pooled combination of all feature vectors.
- Return type:
Tensor
- importance(inp, mask=None)[source]
Per-feature attention weights derived from the learned query.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask,
Truevalues indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults toNone.
- Returns:
Attention weights derived from the learned query. The output tensor has one fewer dimensions than the inp with the last dimension being dropped.
- Return type:
Tensor
- property mod_dim
The embedding size.
- class GlobalWeightsMixer(mod_dim, n_features, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]
Bases:
MixerCombine stacked feature vectors through a learnable linear combination.
A single, global set of linear-combination coefficients is learned and shared across all instances. The coefficients sum to 1 via softmax and can thus be seen as some sort of global feature importance.
- Parameters:
mod_dim (int) – Ignored but mandatory to maintain API compatibility.
n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.
dropout (float, optional) – The amount of dropout to apply to the mixed-features output. Defaults to 0.
skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to
True.keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to
False.device (str or torch.device, optional) – Torch device to first create the mixer on. Defaults to “cpu”.
dtype (torch.dtype, optional) – Torch dtype to first create the mixer in. Defaults to
torch.float.
- property device
The device all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(inp, mask=None)[source]
Forward pass for combining multiple stacked feature vectors.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the feature vectors.
mask (Tensor or None) – Ignored for the
GlobalWeightsMixer.
- Returns:
Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the globally weighted linear combination of all feature vectors.
- Return type:
Tensor
- importance(inp, mask=None)[source]
Learned global feature weights in the normed sum over all features.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the feature vectors.
mask (Tensor or None) – Ignored for the
GlobalWeightsMixer.
- Returns:
The softmax-normalized coefficients broadcast to the shape of inp with the last dimension dropped, i.e.
(..., n_features).- Return type:
Tensor
- property mod_dim
The embedding size.
- property n_features
The number of features in the mix.
- class InstanceWeightsMixer(mod_dim, n_features, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]
Bases:
MixerCombine stacked feature vectors by a per-instance linear combination.
The per-instance coefficients sum to 1 for each data point and can thus be seen as some sort of per-instance feature importance. They are obtained by concatenating all features into a single, wide vector, linearly projecting down to a vector with the same number of elements as there are features to combine, and then applying a softmax.
- Parameters:
mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.
n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.
bias (bool, optional) – Whether to add a learnable bias vector in the projection. Defaults to
True.dropout (float, optional) – The amount of dropout to apply to the mixed-features output. defaults to 0.
skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to
True.keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to
False.device (str or torch.device, optional) – Torch device to first create the embedder on. Defaults to “cpu”.
dtype (torch.dtype, optional) – Torch dtype to first create the embedder in. Defaults to
torch.float.
- property device
The device of all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(inp, mask=None)[source]
Forward pass for combining multiple stacked feature vectors.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None) – Ignored for the
InstanceWeightsMixer.
- Returns:
Depending on keep_dim, the output tensor has the same number of dimensions as the inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the per-instance (normed) linear combination of all feature vectors.
- Return type:
Tensor
- importance(inp, mask=None)[source]
Per-instance weights in the normed linear combination of features.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None) – Ignored for the
InstanceWeightsMixer.
- Returns:
The output tensor has one fewer dimensions than the inp with the last dimension being dropped.
- Return type:
Tensor
- property mod_dim
The embedding size.
- property n_features
The number of features in the mix.
- class SelfAttentionMixer(mod_dim, n_heads=1, bias=True, dropout=0.0, skip=True, keep_dim=False, device='cpu', dtype=torch.float32)[source]
Bases:
MixerCombine stacked feature vectors via multi-head self-attention pooling.
Each feature vector attends to all others, and the resulting attended representations are averaged across the feature dimension to yield a single output vector per instance. The attention weights averaged over all query positions serve as per-instance feature importance scores.
- Parameters:
mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.
n_heads (int, optional) – Number of attention heads. Must evenly divide mod_dim. Defaults to 1.
bias (bool, optional) – Whether to add learnable bias vectors in the attention projections. Defaults to
True.dropout (float, optional) – The amount of dropout to apply to the attention weights as well as to the mixed-features output. Defaults to 0.
skip (bool, optional) – Whether to add a residual connection around the feature mixing. Defaults to
True.keep_dim (bool, optional) – Whether to keep the next-to-last dimension of the output tensor as 1 or squeeze it. Defaults to
False.device (str or torch.device, optional) – Torch device to first create the embedder on. Defaults to “cpu”.
dtype (torch.dtype, optional) – Torch dtype to first create the embedder in. Defaults to
torch.float.
- property device
The device of all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(inp, mask=None)[source]
Forward pass for combining multiple stacked feature vectors.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask,
Truevalues indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults toNone.
- Returns:
Depending on keep_dim, the output tensor has the same number of dimensions as inp or one fewer. The next-to-last dimension is either 1 or dropped. The last dimension (of size mod_dim) contains the attention-pooled combination of all feature vectors.
- Return type:
Tensor
- importance(inp, mask=None)[source]
Per-instance weights in the attention-based combination of features.
- Parameters:
inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.
mask (Tensor or None, optional) – Padding mask with its last dimension of size n_features. For a binary mask,
Truevalues indicates that the corresponding feature will be ignored. For a float mask, the value will be directly added to the corresponding attention-key value. Defaults toNone.
- Returns:
Attention weights averaged over query positions. The output tensor has one fewer dimensions than the inp with the last dimension being dropped.
- Return type:
Tensor
- property mod_dim
The embedding size.