model ===== Now that the data is ready, on to the main course, the model. :mod:`slangmod` trains a Transformer with causal self attention, of which you can customize almost every aspect. The corresponding settings go into a ``[model]`` section in your config TOML file. Alternatively, you can set them on the command line with ``--model. ``. model.dim = 512 Model (embedding) dimension. model.scale_grad_by_freq = True If given, scale gradients of the token embeddings by the inverse of their frequency the mini-batch. model.positions = "vanilla" Where to use which positional encodings. The default "**vanilla**" uses sinusoidal positional encodings added to the raw token embeddings once, before they enter the first self-attention block, just like in the original `Transformer `_ paper. As also mentioned there, you can try "**learnable**" positional encodings at this location in the model. Further options are "**rotary**", effectively changing the model into a `RoFormer `_, and "**sinusoidal**" which adds sinusoidal positional encodings to the input of every transformer layer, not just the first. model.context = 4096 Positional encodings are never computed on-the-fly but are pre-computed and cached up to that length. This is, therefore, the maximum number of tokens that the model can handle. .. warning:: If you chose "**learnable**" positional encodings, then the context must *not* be longer than :ref:`data.seq_len ` because we can only learn positional encodings up to the length of the sequences that the model sees during training. model.n_heads = 8 The number of attention heads to use. model.n_layers = 8 The number of Transformer layers to stack. model.attn_bias = False Whether to add bias to the in- and out-projections of the self-attention. model.dropout = 0.1 Amount of `dropout `_ to apply at various places throughout the model. model.norm_cls = "layer" Which type of norm to use between transformer (sub-)layers. The other choice besides the default of "layer" is "rms". model.norm_bias = True Whether to add bias to the ``LayerNorm`` components (if they are used). model.norm_first = True Whether to normalize the *input* before each Transformer layer or the sum of *output* and residual stream after each layer. model.compile = True Whether to compile the model for faster training. feedforward ----------- For the feed-forward part of the model, you chose the network architecture, the involved non-linearities, as well as its size. These settings go into a nested ``[model.feedforward]`` subsection under ``[model]`` in your config TOML file. Alternatively, you can set them on the commandline with ``--model.feedforward. ``. model.feedforward.flavor = "vanilla" As in the in the original `Transformer `_ paper, the input is project up to a size wider than the ``model.dim``, passed through a non-linearity, and projected back down to ``model.dim`` with the default setting "**vanilla**". Other options are "**glu**", which projects the input up and uses one half to gate the other (in a `Gated Linear Unit `_) *before* projecting back down to ``model.dim`` and "**grn**", which projects up first, applies a non-linearity, and then applies gating *after* projecting back down to ``model.dim``, reminiscent of a `Gated Residual Network `_. model.feedforward.activation = "gelu" Non-linearity to use for the "**vanilla**" and "**grn**" up-projections. Other options are "**elu**", "**relu**", "**swish**", and "**mish**. model.feedforward.gate = "gelu" Non-linearity to use for the gating in "**glu**" and "**grn**". Other options are "**sigmoid**", "**elu**", "**relu**", "**swish**", "**mish**", but also "**none**", which result in a *bilinear* unit. This option is disregarded when using the "**vanilla**" feed-forward layer. model.feedforward.bias = False Whether to use bias in the linear projections of the feedforward network. model.feedforward.factor = 4 The width of the hidden layer in the feed-forward network expressed as a multiple of ``model.dim``.