models
The blocks to build a variety of causal, transformer-based models.
- class Sinusoidal(mod_dim, context, device='cpu', dtype=torch.float32, **_)[source]
Bases:
BlockSinusoidal positional encodings for transformer-based sequence models.
- Parameters:
mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
device (str or device, optional) – Torch device to first create the sinusoidal positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the sinusoidal positional encodings in. Defaults to
torch.float.
- property device
Device that the sinusoidal positional encodings reside on.
- property dtype
Dtype of the sinusoidal positional encodings.
- forward(src)[source]
Add sinusoidal positional encodings to a sequence of embeddings.
- Parameters:
src (Tensor) – Input sequence(s). Must be of dimensions (…, S, mod_dim), where the sequence length S must not exceed context.
- Returns:
The input sequence(s) with sinusoidal positional encodings added.
- Return type:
Tensor
- class Learnable(mod_dim, context, device='cpu', dtype=torch.float32, **_)[source]
Bases:
BlockLearnable positional encodings for transformer-based sequence models.
- Parameters:
mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
device (str or device, optional) – Torch device to create the learnable positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype of the learnable positional encodings. Defaults to
torch.float.
Note
Make sure that the context reflects the maximum length of the sequences that your model sees at training time. In contrast to other types of positional encodings, which can reasonably be expected to generalize well beyond that during inference, positions that have never been encountered during training cannot be encoded at all with
Learnable. Consequently, the user chat history can only be attended to up until that length.See also
- property device
Device that the learnable positional encodings reside on.
- property dtype
Dtype of the learnable positional encodings.
- forward(src)[source]
Add learnable positional encodings to a sequence of embeddings.
- Parameters:
src (Tensor) – Input sequence(s). Must be of dimensions (…, S, mod_dim), where the sequence length S must not exceed context.
- Returns:
The input sequence(s) with positional encodings added.
- Return type:
Tensor
- class Rotary(mod_dim, context, n_heads, device='cpu', dtype=torch.float32, **_)[source]
Bases:
BlockRotary positional encodings for multi-head attention in sequence models.
- Parameters:
mod_dim (int) – The model dimension. Each vector in the original sequence is expected to be of that dimension.
context (int) – The maximum sequence length that can be processed. Inputs are expected to not exceed this size in their next-to-last dimension.
n_heads (int) – The number of attention heads. Must integer divide mod_dim and the result must still be and even number.
device (str or device, optional) – Torch device to first create the rotary positional encodings on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the rotary positional encodings in. Defaults to
torch.float.
- Raises:
ValueError – If n_heads does not integer divide mod_dim or if the result is not an even number.
- property device
Device that the rotary positional encodings reside on.
- property dtype
Dtype of the rotary positional encodings.
- forward(src)[source]
Apply rotary positional encodings across all heads of the input.
- Parameters:
src (Tensor) – Input sequence(s) for all heads. Must be of dimensions (…, n_heads, S, head_dim), where the sequence length S must not exceed context and head_dim is the mod_dim divided by n_heads.
- Returns:
The input sequence(s) with rotary positional encodings applied to all heads.
- Return type:
Tensor
- property head_dim
The dimension of each attention head.
- class SelfAttention(mod_dim, n_heads, bias=False, dropout=0.1, pos_enc=Identity(), device='cpu', dtype=torch.float32)[source]
Bases:
BlockMulti-headed self attention with optional (rotary) positional encodings.
- Parameters:
mod_dim (int) – The model dimension. Inputs are expected to be of that size in their last dimension.
n_heads (int) – The number of attention heads. Must integer divide mod_dim and the result must still be and even number.
bias (bool, optional) – Whether to add a learnable bias vectors in the projections from input to query, key and value and the final out projection. Defaults to
False.dropout (float, optional) – Apply dropout to the attention weights with this probability during training. Defaults to 0.1
pos_enc (Block, optional) –
PyTorch
Modulethathas a
reset_parameters()method,has a
new()method to make fresh copies of itself,has a
contextattribute specifying the maximum sequence length,processes tensors with dimensions (…, n_heads, S, head_dim),
where S is the sequence length, and head_dim is the mod_dim divided by n_heads. If given, it will be called on queries and keys. Typically, this would be an instance of
Rotarypositional encodings. Defaults to an instance ofIdentity, which does nothing.device (str or device, optional) – Torch device to compute self attention on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to compute self attention in. Defaults to
torch.float.
- Raises:
ValueError – If n_heads does not integer divide mod_dim.
See also
- property context
Maximum context length of the positional encodings, if present.
- property device
Device to compute self attention on.
- property dtype
Dtype to compute self attention in.
- forward(src, mask=None, is_causal=True)[source]
Forward pass through multi-headed self attention.
- Parameters:
src (Tensor) – Input sequence(s) of dimensions (…, S, mod_dim), with sequence length S.
mask (Tensor, optional) – Attention mask with a shape broadcastable to the shape of the attention weights (…, S, S). Two types of masks are supported: A boolean mask where a value of
Trueindicates that the element should take part in attention or a float mask of the same dtype as src that is added to the product of queries and keys, before taking the softmax. In the latter case, a value of 0.0 (resulting in unchanged attention weights) indicates that an element should take part in the attention and a value of “-inf” (resulting in a zero attention weight) that it should not. Defaults toNone.is_causal (bool, optional) – If set to
True, inputs are masked with a S x S lower triangular matrix and mask is ignored. Default toTrue.
- Returns:
The output has the same shape as the input.
- Return type:
Tensor
Important
In adhering to the convention of the scaled_dot_product_attention, the meaning of
TrueandFalse(attend to and not attend to, respectively) in boolean attention masks is exactly the opposite of what it means in the MultiheadAttention. Therefore, to stay compatible, use float masks!
- property has_pos_enc
Whether a pos_enc module was provided at instantiation or not.
- property head_dim
The dimension of each attention head.
- property scale
The scaling factor for the per-head attention weights.
- class EncoderLayer(attention, feed_forward, pos_enc=Identity(), bias=True, dropout=0.1, norm_cls='layer', norm_first=True, eps=1e-05, device='cpu', dtype=torch.float32, **_)[source]
Bases:
BlockEncoder layer (i.e., self-attention only) to use in a transformer.
- Parameters:
attention (SelfAttention) – A suitably parameterized instance of
SelfAttention.feed_forward (Block) –
PyTorch
Modulethathas a
reset_parameters()method,has a
new()method to make fresh copies of itself,processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the attention.
pos_enc (Block, optional) –
PyTorch
Modulethathas a
reset_parameters()method,has a
new()method to make fresh copies of itself,has a
contextattribute specifying the maximum sequence length,processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the attention. If given, it will be called on the input tensor first thing. Typically, this would be an instance of
SinusoidalorLearnablepositional encodings. Defaults to an instance ofIdentity, which does nothing.bias (bool, optional) – Whether to use a bias in the
LayerNormcomponents. Defaults toTrue.dropout (float, optional) – Fraction of dropout to apply after self-attention and feed-forward. Defaults to 0.1
norm_cls (str, optional) – Which type of norm to use between (sub-)layers. Defaults to “layer”, but can also be “rms”.
norm_first (bool, optional) – Whether to normalize inputs to attention and feed-forward or the sum of respective inputs and outputs. Defaults to
True.eps (float, optional) – Add this value to the denominator in the
LayerNormcomponents. Defaults to 1e-5.device (str or device, optional) – Torch device to first create the encoder layer on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the layer in. Defaults to
torch.float.
See also
- property bias_kwarg
Extra keyword ‘bias’ for LayerNorm components, if requested.
- property context
Maximum context length given by the positional encodings.
- property device
The device all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(src, mask=None, is_causal=True)[source]
Forward pass of one encoder layer (i.e., with self.attention only).
- Parameters:
src (Tensor) – Input sequence(s) of dimensions (…, S, D), with sequence length S and model dimension D.
mask (Tensor, optional) – Attention mask with a shape broadcastable to the shape of the attention weights (…, S, S). Two types of masks are supported: A boolean mask where a value of
Trueindicates that the element should take part in attention or a float mask of the same dtype as src that is added to the product of queries and keys, before taking the softmax. In the latter case, a value of 0.0 (resulting in unchanged attention weights) indicates that an element should take part in the attention and a value of “-inf” (resulting in a zero attention weight) that it should not. Defaults toNone.is_causal (bool, optional) – If set to
True, inputs are masked with a S x S lower triangular matrix and mask is ignored. Default toTrue.
- Returns:
The output has the same shape as the input.
- Return type:
Tensor
Important
In adhering to the convention of the scaled_dot_product_attention, the meaning of
TrueandFalse(attend to and not attend to, respectively) in boolean attention masks is exactly the opposite of what it means in the Transformer. Therefore, to stay compatible, use float masks!
- property has_pos_enc
Whether positional encodings are applied.
- property mod_dim
The model dimension.
- class Encoder(vocab, layer, n_layers=2, pad_id=0, pos_enc=Identity(), dropout=0.1, scale_grad_by_freq=True, device='cpu', dtype=torch.float32)[source]
Bases:
ResettableFlexible transformer encoder for natural language modeling.
- Parameters:
vocab (int) – The vocabulary size of the tokenizer, i.e., the highest possible token id plus one.
layer (EncoderLayer) – A suitably parameterized instance of
EncoderLayer.n_layers (int, optional) – How often the layer is repeated in the transformer stack. Defaults to 2, but must be at least 1.
pad_id (int, optional) – The id of the padding token. Defaults to 0.
pos_enc (Resettable, optional) –
PyTorch
Modulethathas a
reset_parameters()method,has a
contextattribute specifying the maximum sequence length,processes tensors with dimensions (…, S, D),
where S is the sequence length and D is the model dimension specified in the layer. If given, it will be called on the input tensor first thing. Typically, this would be an instance of
SinusoidalorLearnablepositional encodings. Defaults to an instance ofIdentity, which does nothing.dropout (float, optional) – Apply dropout to the sum of token embedding and positional encodings with this probability during training. Defaults to 0.1
scale_grad_by_freq (bool, optional) – Whether to scale the gradients on the token embeddings by the inverse frequency of their occurrence.
device (str or device, optional) – Torch device to first create the transformer on. Defaults to “cpu”.
dtype (dtype, optional) – Torch dtype to first create the transformer encoder stack in. Defaults to
torch.float.
- Raises:
TypeError – If neither the encoder itself nor the layer applies any positional encodings.
See also
- property context
Maximum context length permitted by the positional encodings.
- property device
The device of all weights, biases, activations, etc. reside on.
- property dtype
The dtype of all weights, biases, activations, and parameters.
- forward(src, attn_mask=None, src_mask=None, is_causal=True)[source]
Forward pass through the transformer encoder with optional masking.
- Parameters:
src (Tensor) – Input sequence(s) of token indices. Must be of dtype int64 (=long). Expected dimensions are (…, S), with S the sequence length.
attn_mask (Tensor, optional) – Floating-point attention mask with a shape broadcastable to the shape of the attention weights (…, S, S) to be added to the product of queries and keys, before taking the softmax. A value of 0.0 (resulting in unchanged attention weights) indicates that an element should be attended to and a value of “-inf” (resulting in a zero attention weight) that it should not be attended to. Defaults to
None.src_mask (Tensor, optional) – Floating-point attention mask with a shape broadcastable to the shape of src (…, S). A value of 0.0 indicates that an element should be attended to and a value of “-inf” that it should not be attended to. Defaults to
None.is_causal (bool, optional) – If set to
True, inputs are masked with a causal S x S triangular matrix (as produced by generate_square_subsequent_mask) and both attn_mask and src_mask are ignored. Defaults toTrue.
- Returns:
Un-normalized logits over the next-token probabilities for each position with dimensions (…, vocab, S), where S is again the sequence length.
- Return type:
Tensor
Important
Boolean attention masks are not accepted!
- property mod_dim
The model dimension.
- property scale
Square root of model dimension for scaling the input embeddings.