train

Training any neural network efficiently and to full convergence is somewhat of a black art. slangmod gives you full control over the process. The following options should go into the [train] section of your config TOML file, but they can also be configured on the command line with --train.<KEY> <VALUE>.

batch_size = 64

Size of mini-batches to process. Adjust so that you fully exploit your GPU’s memory.

step_freq = 1

Gradients are accumulated for this many batches before the optimizer takes a step. Use in case memory shortage necessitates exceedingly small batch sizes. For example, you could set it to 2 for a batch_size of 16, to 4 for a batch_size of 8, and so on.

clip_grad = 0.6

When the overall norm of all parameter gradients in the model exceeds this value, they are scaled back accordingly. This helps stabilize convergence, especially in the beginning of a training run.

label_smoothing = 0.1

Will be forwarded to the cross-entropy loss.

optimizer = “adamw”

Optimizer to use. The choice is between adamw (the default) and adafactor (if memory is an issue).

max_epochs = 16

Maximum number of times that all training data will be shown to the model.

patience = None

In contrast to what this seems to imply (you have no patience), the default is actually to have no early stopping and to run for the full max_epochs. If you want early stopping to be active, set this to a positive integer > 1, indicating the number of consecutive epochs that the test loss has to increase before stopping training and retrieving the best checkpoint up until that point.

learning_rate = 0.001

Determines the step size taken by the optimizer in model-parameter space.

warmup = 8_000

Number of initial mini-batches during which the learning rate is linearly ramped up to the specified value.

scaling = “inverse”

Functional form of how the learning rate is decayed again after reaching the specified value. “inverse” scales with one over the number of mini-batches, “exponential” scales with one over a constant to the power of the number of mini-batches, and “cosine” scales as a quarter wave from 1 down to zero.

power = 0.5

Specifies the negative exponent of the number of mini-batches for “inverse” scaling. Must be a positive number from the interval [0.5, 1.0].

gamma = 0.95

The constant to take take the (negative) power of for “exponential” scaling. Must be smaller than 1.

cooldown = 100_000

The number of mini-batches until “cosine” scaling reaches a learning rate of 0.

cb_freq = 1

Every how many mini-batches to log the training loss and the current learning rate.