weighted

Combine feature embedding through weighted sums.

Depending on whether these weights are learnable themselves and on whether they depend on the input features or not, some type of feature importance can be provided.

class ActivatedSumMixer(mod_dim, n_features, activate=<function identity>, **kwargs)[source]

Bases: Resettable

Combine stacked feature vectors by a per-instance linear combination.

The per-instance coefficients sum to 1 for each data point and can thus be seen as some sort of per-instance feature importance. They are obtained by concatenating all features into a single, wide vector, linearly projecting down to a vector with the same number of elements as there are features to combine, optionally activating, and then applying a softmax.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

  • activate (Module or function, optional) – The activation function to be applied after (linearly) mixing the concatenated features. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional`, depending on whether it needs to be further parameterized or not. Defaults to identity, resulting in no non-linear activation whatsoever.

  • **kwargs – Keyword arguments are passed on to the linear layer.

forward(inp)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input. The next-to-last dimension is dropped and the last dimension now contains the per-instance (normed) linear combination of all feature vectors.

Return type:

Tensor

importance(inp)[source]

Per-instance weights in the normed linear combination of features.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input with the last dimension being dropped.

Return type:

Tensor

new(mod_dim=None, n_features=None, activate=None, **kwargs)[source]

Return a fresh instance with the same or updated parameters.

Parameters:
  • mod_dim (int, optional) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension. Overwrites the mod_dim of the current instance if given. Defaults to None.

  • n_features (int, optional) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor. Overwrites n_features of the current instance if given. Defaults to None.

  • activate (Module or function, optional) – The activation function to be applied after (linearly) mixing the concatenated features. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional`, depending on whether it needs to be further parameterized or not. Overwrites the activate of the current instance if given. Defaults to None.

  • **kwargs – Additional keyword arguments are merged into the keyword arguments of the current instance and are then passed through to the linear layer together.

Returns:

A fresh, new instance of itself.

Return type:

ActivatedSumMixer

reset_parameters()[source]

Re-initialize all internal parameters.

class ConstantSumMixer(n_features)[source]

Bases: Resettable

Combine stacked feature vectors by simply adding them.

The sum is then “normed” through dividing by the number of features.

Parameters:

n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

forward(inp)[source]

Add stacked feature vectors with constant and equal weights.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension is expected to contain the features vectors themselves.

Returns:

The output tensor has one fewer dimensions than the input. The next-to-last dimension is dropped and the last dimension now contains the (normed) sum of all feature vectors.

Return type:

Tensor

importance(inp)[source]

Constant feature weights in the normed sum over all features.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension is expected to contain the features vectors themselves.

Returns:

The output tensor has one fewer dimensions than the input with the last dimension being dropped.

Return type:

Tensor

new(n_features=None)[source]

Return a fresh instance with the same or updated parameters.

Parameters:

n_features (int, optional) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor. Overwrites n_features of the current instance if given. Defaults to None.

Returns:

A fresh, new instance of itself.

Return type:

ConstantSumMixer

reset_parameters()[source]

Does nothing because there are no internal parameters to reset.

class GatedResidualSumMixer(mod_dim, n_features, activate=ELU(alpha=1.0), gate=Sigmoid(), drop=Dropout(p=0.0, inplace=False), **kwargs)[source]

Bases: Resettable

Combine stacked feature vectors by a per-instance linear combination.

The per-instance coefficients sum to 1 for each data point and can thus be seen as some sort of per-instance feature importance. They are obtained by concatenating all features into a single, wide vector, and then passing it through a Gated Residual Network (GRN), [1] such that the size of the output equals the number of features to combine.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

  • activate (Module or function, optional) – The activation function to be applied after (linear) transformation, but prior to gating. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional`, depending on whether it needs to be further parameterized or not. Defaults to an ELU activation.

  • gate (Module or function, optional) – The activation function to be applied to half of the (non-linearly) transformed input before multiplying with the other half. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional, depending on whether it needs to be further parameterized or not. Defaults to a sigmoid.

  • drop (Module, optional) – Typically an instance of Dropout or AlphaDropout. Defaults to Dropout(p=0.0), resulting in no dropout being applied.

  • **kwargs – Additional keyword arguments to pass through to the linear layers.

Note

This implementation is inspired by how features are combined in Temporal Fusion Transformers, [1] but it is not quite the same. Firstly, the inputs are not the raw feature embeddings but the final feature embeddings to be linearly combined. Secondly, the intermediate linear layer (Eq. 3) is eliminated and dropout is applied directly to the activations after the first layer. Finally, the layer norm (Eq. 2) is omitted because normalizing right before passing through a softmax seems unnecessary.

References

forward(inp)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input. The next-to-last dimension is dropped and the last dimension now contains the per-instance (normed) linear combination of all feature vectors.

Return type:

Tensor

importance(inp)[source]

Per-instance weights in the normed linear combination of features.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input with the last dimension being dropped.

Return type:

Tensor

new(mod_dim=None, n_features=None, activate=None, gate=None, drop=None, **kwargs)[source]

Return a fresh instance with the same or updated parameters.

Parameters:
  • mod_dim (int, optional) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension. Overwrites the mod_dim of the current instance if given. Defaults to None.

  • n_features (int, optional) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor. Overwrites n_features of the current instance if given. Defaults to None.

  • activate (Module or function, optional) – The activation function to be applied after (linear) transform, but prior to gating. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional`, depending on whether it needs to be further parameterized or not. Overwrites activate of the current instance if given. Defaults to None.

  • gate (Module or function, optional) – The activation function to be applied to half of the (linearly) transformed input before multiplying with the other half. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional. Overwrites the gate of the current instance if given. Defaults to None.

  • drop (Module, optional) – Typically an instance of Dropout or AlphaDropout. Overwrites the drop of the current instance if given. Defaults to None.

  • **kwargs – Additional keyword arguments are merged into the keyword arguments of the current instance and are then passed through to the linear layers together.

Returns:

A fresh, new instance of itself.

Return type:

GatedResidualSumMixer

reset_parameters()[source]

Re-initialize all internal parameters.

class GatedSumMixer(mod_dim, n_features, gate=Sigmoid(), **kwargs)[source]

Bases: Resettable

Combine stacked feature vectors by a per-instance linear combination.

The per-instance coefficients sum to 1 for each data point and can thus be seen as some sort of per-instance feature importance. They are obtained by concatenating all features into a single, wide vector, and linearly projecting it down to a vector with twice as many elements as there are features. One half is then passed through an (optional) activation function to gate the other half, thus reducing the final output back down to the number of features to combine.

Parameters:
  • mod_dim (int) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension.

  • n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

  • gate (Module or function, optional) – The activation function to be applied to half of the (linearly) transformed inputs before multiplying with the other half. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional. Defaults to a sigmoid.

  • **kwargs – Keyword arguments are passed on to the linear layer.

forward(inp)[source]

Forward pass for combining multiple stacked feature vectors.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input. The next-to-last dimension is dropped and the last dimension now contains the per-instance (normed) linear combination of all feature vectors.

Return type:

Tensor

importance(inp)[source]

Per-instance weights in the normed linear combination of features.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension (of size mod_dim) is expected to contain the features vectors.

Returns:

The output tensor has one fewer dimensions than the input with the last dimension being dropped.

Return type:

Tensor

new(mod_dim=None, n_features=None, gate=None, **kwargs)[source]

Return a fresh instance with the same or updated parameters.

Parameters:
  • mod_dim (int, optional) – Size of the feature space. The input tensor is expected to be of that size in its last dimension and the output will again have this size in its last dimension. Overwrites the mod_dim of the current instance if given. Defaults to None.

  • n_features (int, optional) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor. Overwrites n_features of the current instance if given. Defaults to None.

  • gate (Module or function, optional) – The activation function to be applied to half of the (linearly) transformed input before multiplying with the other half. Must be a callable that accepts a tensor as sole argument, like a module from torch.nn or a function from torch.nn.functional. Overwrites the gate of the current instance if given. Defaults to None.

  • **kwargs – Additional keyword arguments are merged into the keyword arguments of the current instance and are then passed through to the linear layer together.

Returns:

A fresh, new instance of itself.

Return type:

GatedSumMixer

reset_parameters()[source]

Re-initialize all internal parameters.

class VariableSumMixer(n_features)[source]

Bases: Resettable

Combine stacked feature vectors through a learnable linear combination.

Specifically, a single, global set of linear-combination coefficients is learned. These coefficients sum to 1 and can thus be seen as some sort of feature importance.

Parameters:

n_features (int) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor.

forward(inp)[source]

Linearly combine stacked feature vectors with global coefficients.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension is expected to contain the features vectors themselves.

Returns:

The output tensor has one fewer dimensions than the input. The next-to-last dimension is dropped and the last dimension now contains the (normed) linear combination of all feature vectors.

Return type:

Tensor

importance(inp)[source]

Learned, global feature weights in the normed sum over all features.

Parameters:

inp (Tensor) – Feature vectors stacked into a tensor of at least 2 dimensions. The size of the next-to-last last dimension is expected to match the n_features provided at instantiation. The last dimension is expected to contain the features vectors themselves.

Returns:

The output tensor has one fewer dimensions than the input with the last dimension being dropped.

Return type:

Tensor

new(n_features=None)[source]

Return a fresh instance with the same or updated parameters.

Parameters:

n_features (int, optional) – The number of features to combine. Must be equal to the size of the next-to-last dimension of the input tensor. Overwrites n_features of the current instance if given. Defaults to None.

Returns:

A fresh, new instance of itself.

Return type:

VariableSumMixer

reset_parameters()[source]

Re-initialize the coefficients for the linear combination.