model
Now that the data is ready, on to the main course, the model. slangmod
trains a Transformer with causal self attention, of which you can customize
almost every aspect. The corresponding settings go into a [model] section
in your config TOML file. Alternatively, you can set them on the command line
with --model.<KEY> <VALUE>.
- model.dim = 512
Model (embedding) dimension.
- model.scale_grad_by_freq = True
If given, scale gradients of the token embeddings by the inverse of their frequency the mini-batch.
- model.positions = “vanilla”
Where to use which positional encodings. The default “vanilla” uses sinusoidal positional encodings added to the raw token embeddings once, before they enter the first self-attention block, just like in the original Transformer paper. As also mentioned there, you can try “learnable” positional encodings at this location in the model. Further options are “rotary”, effectively changing the model into a RoFormer, and “sinusoidal” which adds sinusoidal positional encodings to the input of every transformer layer, not just the first.
- model.context = 4096
Positional encodings are never computed on-the-fly but are pre-computed and cached up to that length. This is, therefore, the maximum number of tokens that the model can handle.
Warning
If you chose “learnable” positional encodings, then the context must not be longer than data.seq_len because we can only learn positional encodings up to the length of the sequences that the model sees during training.
- model.n_heads = 8
The number of attention heads to use.
- model.n_layers = 8
The number of Transformer layers to stack.
- model.attn_bias = False
Whether to add bias to the in- and out-projections of the self-attention.
- model.dropout = 0.1
Amount of dropout to apply at various places throughout the model.
- model.norm_cls = “layer”
Which type of norm to use between transformer (sub-)layers. The other choice besides the default of “layer” is “rms”.
- model.norm_bias = True
Whether to add bias to the
LayerNormcomponents (if they are used).- model.norm_first = True
Whether to normalize the input before each Transformer layer or the sum of output and residual stream after each layer.
- model.compile = True
Whether to compile the model for faster training.
feedforward
For the feed-forward part of the model, you chose the network architecture,
the involved non-linearities, as well as its size. These settings go into a
nested [model.feedforward] subsection under [model] in your config
TOML file. Alternatively, you can set them on the commandline with
--model.feedforward.<KEY> <VALUE>.
- model.feedforward.flavor = “vanilla”
As in the in the original Transformer paper, the input is project up to a size wider than the
model.dim, passed through a non-linearity, and projected back down tomodel.dimwith the default setting “vanilla”. Other options are “glu”, which projects the input up and uses one half to gate the other (in a Gated Linear Unit) before projecting back down tomodel.dimand “grn”, which projects up first, applies a non-linearity, and then applies gating after projecting back down tomodel.dim, reminiscent of a Gated Residual Network.- model.feedforward.activation = “gelu”
Non-linearity to use for the “vanilla” and “grn” up-projections. Other options are “elu”, “relu”, “swish”, and “mish.
- model.feedforward.gate = “gelu”
Non-linearity to use for the gating in “glu” and “grn”. Other options are “sigmoid”, “elu”, “relu”, “swish”, “mish”, but also “none”, which result in a bilinear unit. This option is disregarded when using the “vanilla” feed-forward layer.
- model.feedforward.bias = False
Whether to use bias in the linear projections of the feedforward network.
- model.feedforward.factor = 4
The width of the hidden layer in the feed-forward network expressed as a multiple of
model.dim.