models

The blocks to build a variety of causal, transformer-based models.

class Sinusoidal(mod_dim, context, device='cpu', dtype=torch.float32, **_)[source]

Bases: Block

Sinusoidal positional encodings for transformer-based sequence models.

Parameters:

mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
device (str or device, optional) – Torch device to first create the sinusoidal positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the sinusoidal positional encodings in. Defaults to torch.float.

property device: Device that the sinusoidal positional encodings reside on.

property dtype: Dtype of the sinusoidal positional encodings.

forward(src)[source]

Add sinusoidal positional encodings to a sequence of embeddings.

Parameters:: src (Tensor) – Input sequence(s). Must be of dimensions (…, S, mod_dim), where the sequence length S must not exceed context.
Returns:: The input sequence(s) with sinusoidal positional encodings added.
Return type:: Tensor

new()[source]: Return a fresh, new instance with exactly the same parameters.

reset_parameters()[source]: Does nothing because there are no internal parameters to reset.

class Learnable(mod_dim, context, device='cpu', dtype=torch.float32, **_)[source]

Bases: Block

Learnable positional encodings for transformer-based sequence models.

Parameters:

mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
device (str or device, optional) – Torch device to create the learnable positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype of the learnable positional encodings. Defaults to torch.float.

Note

Make sure that the context reflects the maximum length of the sequences that your model sees at training time. In contrast to other types of positional encodings, which can reasonably be expected to generalize well beyond that during inference, positions that have never been encountered during training cannot be encoded at all with Learnable. Consequently, the user chat history can only be attended to up until that length.

See also

Sinusoidal, Rotary

property device: Device that the learnable positional encodings reside on.

property dtype: Dtype of the learnable positional encodings.

forward(src)[source]

Add learnable positional encodings to a sequence of embeddings.

Parameters:: src (Tensor) – Input sequence(s). Must be of dimensions (…, S, mod_dim), where the sequence length S must not exceed context.
Returns:: The input sequence(s) with positional encodings added.
Return type:: Tensor

new()[source]: Return a fresh, new instance with exactly the same parameters.

reset_parameters()[source]: Re-initialize the learnable positional encodings.

class Rotary(mod_dim, context, n_heads, device='cpu', dtype=torch.float32, **_)[source]

Bases: Block

Rotary positional encodings for multi-head attention in sequence models.

Parameters:

mod_dim (int) – The model dimension. Each vector in the original sequence is expected to be of that dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
n_heads (int) – The number of attention heads. Must integer divide mod_dim and the result must still be and even number.
device (str or device, optional) – Torch device to first create the rotary positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the rotary positional encodings in. Defaults to torch.float.

Raises:

ValueError – If n_heads does not integer divide mod_dim or if the result is not an even number.

property device: Device that the rotary positional encodings reside on.

property dtype: Dtype of the rotary positional encodings.

forward(src)[source]

Apply rotary positional encodings across all heads of the input.

Parameters:: src (Tensor) – Input sequence(s) for all heads. Must be of dimensions (…, n_heads, S, head_dim), where the sequence length S must not exceed context and head_dim is the mod_dim divided by n_heads.
Returns:: The input sequence(s) with rotary positional encodings applied to all heads.
Return type:: Tensor

property head_dim: The dimension of each attention head.

new()[source]: Return a fresh, new instance with exactly the same parameters.

reset_parameters()[source]: Does nothing because there are no internal parameters to reset.

class SelfAttention(mod_dim, n_heads, bias=False, dropout=0.1, pos_enc=Identity(), device='cpu', dtype=torch.float32)[source]

Bases: Block

Multi-headed self attention with optional (rotary) positional encodings.

Parameters:

mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
n_heads (int) – The number of attention heads. Must integer divide mod_dim and the result must still be and even number.
bias (bool, optional) – Whether to add a learnable bias vectors in the projections from input to query, key and value and the final out projection. Defaults to False.
dropout (float, optional) – Apply dropout to the attention weights with this probability during training. Defaults to 0.1
pos_enc (Block, optional) –
PyTorch Module that
- has a reset_parameters() method,
- has a new() method to make fresh copies of itself,
- has a context attribute specifying the maximum sequence length,
- processes tensors with dimensions (…, n_heads, S, head_dim),
where S is the sequence length, and head_dim is the mod_dim divided by n_heads. If given, it will be called on queries and keys. Typically, this would be an instance of Rotary positional encodings. Defaults to an instance of Identity, which does nothing.
device (str or device, optional) – Torch device to compute self attention on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to compute self attention in. Defaults to torch.float.

Raises:

ValueError – If n_heads does not integer divide mod_dim.

See also

Rotary

property context: Maximum context length of the positional encodings, if present.

property device: Device to compute self attention on.

property dtype: Dtype to compute self attention in.

forward(src, mask=None, is_causal=True)[source]

Forward pass through multi-headed self attention.

Parameters:

src (Tensor) – Input sequence(s) of dimensions (…, S, mod_dim), with sequence length S.
mask (Tensor, optional) – Attention mask with a shape broadcastable to the shape of the attention weights (…, S, S). Two types of masks are supported: A boolean mask where a value of True indicates that the element should take part in attention or a float mask of the same dtype as src that is added to the product of queries and keys, before taking the softmax. In the latter case, a value of 0.0 (resulting in unchanged attention weights) indicates that an element should take part in the attention and a value of “-inf” (resulting in a zero attention weight) that it should not. Defaults to None.
is_causal (bool, optional) – If set to True, inputs are masked with a S x S lower triangular matrix and mask is ignored. Default to True.

Returns:

The output has the same shape as the input.

Return type:

Tensor

Important

In adhering to the convention of the scaled_dot_product_attention, the meaning of True and False (attend to and not attend to, respectively) in boolean attention masks is exactly the opposite of what it means in the MultiheadAttention. Therefore, to stay compatible, use float masks!

property has_pos_enc: Whether a pos_enc module was provided at instantiation or not.

property head_dim: The dimension of each attention head.

new()[source]: Return a fresh, new instance with exactly the same parameters.

reset_parameters()[source]: Reset the internal parameters of the projections and pos_enc.

property scale: The scaling factor for the per-head attention weights.

class EncoderLayer(attention, feed_forward, pos_enc=Identity(), bias=True, dropout=0.1, norm_cls='layer', norm_first=True, eps=1e-05, device='cpu', dtype=torch.float32, **_)[source]

Bases: Block

Encoder layer (i.e., self-attention only) to use in a transformer.

Parameters:

attention (SelfAttention) – A suitably parameterized instance of SelfAttention.
feed_forward (Block) –
PyTorch Module that
- has a reset_parameters() method,
- has a new() method to make fresh copies of itself,
- processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the attention.
pos_enc (Block, optional) –
PyTorch Module that
- has a reset_parameters() method,
- has a new() method to make fresh copies of itself,
- has a context attribute specifying the maximum sequence length,
- processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the attention. If given, it will be called on the input tensor first thing. Typically, this would be an instance of Sinusoidal or Learnable positional encodings. Defaults to an instance of Identity, which does nothing.
bias (bool, optional) – Whether to use a bias in the LayerNorm components. Defaults to True.
dropout (float, optional) – Fraction of dropout to apply after self-attention and feed-forward. Defaults to 0.1
norm_cls (str, optional) – Which type of norm to use between (sub-)layers. Defaults to “layer”, but can also be “rms”.
norm_first (bool, optional) – Whether to normalize inputs to attention and feed-forward or the sum of respective inputs and outputs. Defaults to True.
eps (float, optional) – Add this value to the denominator in the LayerNorm components. Defaults to 1e-5.
device (str or device, optional) – Torch device to first create the encoder layer on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the layer in. Defaults to torch.float.

property bias_kwarg: Extra keyword ‘bias’ for LayerNorm components, if requested.

property context: Maximum context length given by the positional encodings.

property device: The device all weights, biases, activations, etc. reside on.

property dtype: The dtype of all weights, biases, activations, and parameters.

forward(src, mask=None, is_causal=True)[source]

Forward pass of one encoder layer (i.e., with self.attention only).

Parameters:

src (Tensor) – Input sequence(s) of dimensions (…, S, D), with sequence length S and model dimension D.
mask (Tensor, optional) – Attention mask with a shape broadcastable to the shape of the attention weights (…, S, S). Two types of masks are supported: A boolean mask where a value of True indicates that the element should take part in attention or a float mask of the same dtype as src that is added to the product of queries and keys, before taking the softmax. In the latter case, a value of 0.0 (resulting in unchanged attention weights) indicates that an element should take part in the attention and a value of “-inf” (resulting in a zero attention weight) that it should not. Defaults to None.
is_causal (bool, optional) – If set to True, inputs are masked with a S x S lower triangular matrix and mask is ignored. Default to True.

Returns:

The output has the same shape as the input.

Return type:

Tensor

Important

In adhering to the convention of the scaled_dot_product_attention, the meaning of True and False (attend to and not attend to, respectively) in boolean attention masks is exactly the opposite of what it means in the Transformer. Therefore, to stay compatible, use float masks!

property has_pos_enc: Whether positional encodings are applied.

property mod_dim: The model dimension.

new()[source]: Return a fresh, new instance with exactly the same parameters.

reset_parameters()[source]: Reset all internal parameters of the layer.

class Encoder(vocab, layer, n_layers=2, pad_id=0, pos_enc=Identity(), dropout=0.1, scale_grad_by_freq=True, device='cpu', dtype=torch.float32)[source]

Bases: Resettable

Flexible transformer encoder for natural language modeling.

Parameters:

vocab (int) – The vocabulary size of the tokenizer, i.e., the highest possible token id plus one.
layer (EncoderLayer) – A suitably parameterized instance of EncoderLayer.
n_layers (int, optional) – How often the layer is repeated in the transformer stack. Defaults to 2, but must be at least 1.
pad_id (int, optional) – The id of the padding token. Defaults to 0.
pos_enc (Resettable, optional) –
PyTorch Module that
- has a reset_parameters() method,
- has a context attribute specifying the maximum sequence length,
- processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the layer. If given, it will be called on the input tensor first thing. Typically, this would be an instance of Sinusoidal or Learnable positional encodings. Defaults to an instance of Identity, which does nothing.
dropout (float, optional) – Apply dropout to the sum of token embedding and positional encodings with this probability during training. Defaults to 0.1
scale_grad_by_freq (bool, optional) – Whether to scale the gradients on the token embeddings by the inverse frequency of their occurrence.
device (str or device, optional) – Torch device to first create the transformer on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the transformer encoder stack in. Defaults to torch.float.

Raises:

TypeError – If neither the encoder itself nor the layer applies any positional encodings.

property context: Maximum context length permitted by the positional encodings.

property device: The device of all weights, biases, activations, etc. reside on.

property dtype: The dtype of all weights, biases, activations, and parameters.

forward(src, attn_mask=None, src_mask=None, is_causal=True)[source]

Forward pass through the transformer encoder with optional masking.

Parameters:

src (Tensor) – Input sequence(s) of token indices. Must be of dtype int64 (=long). Expected dimensions are (…, S), with S the sequence length.
attn_mask (Tensor, optional) – Floating-point attention mask with a shape broadcastable to the shape of the attention weights (…, S, S) to be added to the product of queries and keys, before taking the softmax. A value of 0.0 (resulting in unchanged attention weights) indicates that an element should be attended to and a value of “-inf” (resulting in a zero attention weight) that it should not be attended to. Defaults to None.
src_mask (Tensor, optional) – Floating-point attention mask with a shape broadcastable to the shape of src (…, S). A value of 0.0 indicates that an element should be attended to and a value of “-inf” that it should not be attended to. Defaults to None.
is_causal (bool, optional) – If set to True, inputs are masked with a causal S x S triangular matrix (as produced by generate_square_subsequent_mask) and both attn_mask and src_mask are ignored. Defaults to True.

Returns:

Un-normalized logits over the next-token probabilities for each position with dimensions (…, vocab, S), where S is again the sequence length.

Return type:

Tensor

Important

Boolean attention masks are not accepted!

property mod_dim: The model dimension.

reset_parameters()[source]: Reset all learnable parameters in all components of the model.

property scale: Square root of model dimension for scaling the input embeddings.