data

Flexibly fold and pad input data into sequences of equal length and then wrap them into custom data producers to be used in the training loop as well as for validation.

Refer to the pertinent documentation of the swak package (available on PyPI) for how to best use these data producers in the best training loop.

class TestSequenceFolder(seq_len, pad_id=0)[source]

Bases: ArgRepr

Warp and right-pad a single sequence into many with specified length.

Parameters:
  • seq_len (int) – The desired sequence length. Because, in next-token prediction, the target sequence is the source offset by one, the output tensor size will be one more than this number in its second and last dimension.

  • pad_id (int, optional) – Integer index to pad sequences with so that all parts have the same length, that is, seq_len + 1. Defaults to 0. Make sure that this index is consistently used also for training embeddings and in your loss function!

Raises:
  • TypeError – If seq_len is not an integer.

  • ValueError – If seq_len is smaller than 2.

__call__(sequence)[source]

Warp and right-pad a single sequence into many of the given length.

Each row of the output can then conveniently be split into source (row[:-1]) and target (row[1:]) for next-token prediction such that both have the given seq_len.

Parameters:

sequence (Tensor) – The input sequence to warp and pad. Must be a 1-dimensional tensor.

Returns:

Output tensor of size seq_len + 1 in its second and last dimension and as many rows as needed to accommodate all elements in its first dimension.

Return type:

Tensor

Note

Sequences that are folded into more than one row will never have any row that has fewer than 2 non-padding entries because these would be useless in evaluating next-token prediction. However, empty sequences or sequences of length 1 will still be padded to a tensor of sizes 1 and seq_len + 1 in their first and second dimension, respectively. Consequently, they should be filtered out beforehand!

property width

Size of the output tensor in its second and last dimension.

class TestData(seqs, device='cpu', dtype=torch.float32)[source]

Bases: TestDataBase

Wraps test and validation data to provide batches over a sample.

Parameters:
  • seqs (Tensor) – A PyTorch tensor with dimensions (N, S + 1) of dtype int64 or, equivalently, long, where N is the number of (padded) test/validation sequences and S is the (padded) sequence length. The “+1” is needed to provide the target for next-token prediction. Typically, this tensor will reside in CPU memory.

  • device (str or device, optional) – Torch device to push individual batches of data to. Defaults to “cpu”, but will typically be “cuda”.

property n

Total number of test or validation sequences.

sample(batch_size, max_n=None)[source]

Reproducible sample of test or validation data for model evaluation.

Parameters:
  • batch_size (int) – The desired batch size. If the number of sequences is not integer divisible by that number, one of the batches will be smaller.

  • max_n (int, optional) – The maximum number of sequences to provide in the sample, limited by how many there are in total. If not given, all sequences will be provided. Defaults to None.

Returns:

The items produced by the iterator are 2-tuples, with the first element being a one-tuple with a single batch of data with dimensions (batch_size, seq_len) and the second being a tensor of the same dimensions representing the target token ids, i.e., the input shifted by one position.

Return type:

Iterator

property seq_len

Length of (padded) test or validation sequences.

class TrainSequenceFolder(seq_len, pad_id=0, overlap=0.0, jitter=1)[source]

Bases: ArgRepr

Warp and right-pad a single sequence into many with specified length.

Parameters:
  • seq_len (int) – The desired sequence length.

  • pad_id (int, optional) – Integer index to pad sequences with so that all parts have the same length and can be stored in a single tensor. Defaults to 0. Make sure that this index is consistently used also for training embeddings and in your loss function!

  • overlap (int or float, optional) – To teach a language model longer consistent context lengths, training sequences can be made to overlap to some extent, such that the beginning of each sequence is the end of the last. If the overlap is strictly smaller than 1, then it is interpreted as a fraction of the sequence length. If it is 1 or larger, it is interpreted as an integer number of positions to overlap. Defaults to 0.0 which means no overlap between consecutive parts.

  • jitter (int, optional) – To introduce some variability into the training data, one can slightly and randomly shift sequences by a few positions every time they are used. That way, over-reliance on any specific positional alignment is avoided. If jitter > 1, then the sequence parts are extended such that they can be selected with an offset.

Raises:

ValidationErrors – If any of seq_len, overlap, and jitter have either the wrong type or a value that does not make sense.

__call__(sequence)[source]

Warp and right-pad a single sequence into many of the given length.

Each row of the output can then conveniently be split into source (row[:-1]) and target (row[1:]) for next-token prediction such that both have the given seq_len. If jitter > 1, shifted source (row[offset: seq_len + offset]) and target sequences (row[offset + 1: seq_len + offset + 1]) can be extracted from each row, with offset from the interval [0, jitter).

Parameters:

sequence (Tensor) – The input sequence to warp and pad. Must be a 1-dimensional tensor.

Returns:

Output tensor of size seq_len + jitter in its second and last dimension and as many rows as needed to accommodate all elements in its first dimension.

Return type:

Tensor

Note

Sequences that are folded into more than one row will never have any row that has fewer than 2 non-padding entries because these would be useless in evaluating next-token prediction. However, empty sequences or sequences of length 1 will still be padded to a tensor of sizes 1 and seq_len + jitter in their first and second dimension, respectively. Consequently, they should be filtered out beforehand!

property stride

The number of elements between the two consecutive sequences.

property width

Size of the output tensor in its second and last dimension.

class TrainData(seqs, shuffle=True, jitter=1, device='cpu', dtype=torch.float32)[source]

Bases: TrainDataBase

Wraps training data to provide batches and samples.

Parameters:
  • seqs (Tensor or LazyCatDim0) – A PyTorch tensor or an instance of LazyCatDim0 with dimensions (N, S + jitter) of dtype int64 or, equivalently, long, where N is the number of (padded) train sequences and S is the (padded) sequence length. Evidently, jitter must be at least 1 to provide the target for next-token prediction. Typically, this tensor will reside in CPU memory.

  • shuffle (bool, optional) – Whether to randomize training data from one epoch to the next. If True, sequences will be shifted by a random offset of up to jitter and the ordering of batches will be randomized as well. Defaults to True.

  • jitter (int, optional) – Maximum position index to randomly shift the start of the training sequences to for the next epoch, provided that shuffle is True. Defaults to 1, which means that sequences will always start from the beginning. If sequences were not extended to account for this jitter, by using the TrainSequenceFolder, for example, then the length of the training sequences will also randomly change from one epoch to the next.

  • device (str or device, optional) – Torch device to push individual batches of data to. Defaults to “cpu”, but will typically be “cuda”.

__call__(batch_size, step_freq=1, _=0)[source]

Iterator over batches if training data for actual model training.

Parameters:
  • batch_size (int) – The desired batch size. If the number of sequences is not integer divisible by that number and step_freq is 1, one of the batches will be smaller than all others.

  • step_freq (int, optional) – In case this number is > 1, all batches will have the exact batch_size such that losses accumulated over multiple batches can be appropriately scaled before taking an optimizer step. Defaults to 1.

Returns:

  • n_batches (int) – Total number of batches the returned iterator will provide.

  • batches (Iterator) – The items produced by the iterator are 2-tuples, with the first element being a one-tuple with a single batch of data with dimensions (batch_size, seq_len) and the second being a tensor of the same dimensions representing the target token ids, i.e., the input shifted by one position.

adjust_batches_for(batch_size, step_freq=1, n=None)

Number of batches reduced to be suitably integer-divisible.

This is a helper method for users to implement the __call__ method in the case of step_freq > 1. The returned number of batches is guaranteed to be integer-divisible by step_freq so that no batches are “left over” at the end of the epoch.

Parameters:
  • batch_size (int) – The desired number of data points in one batch.

  • step_freq (int, optional) – In case this number is > 1, the optimizer will accumulate gradients for that many batches before taking a step. All batches should be of the same size in this case and there shouldn’t be any “left-over” batches at the end of each epoch. Defaults to 1.

  • n (int, optional) – In rare cases, it might be useful to pass in the number of data points to adjust rather than taking the number of data points returned by self.n, which is the default.

Returns:

Reduced number of batches that is guaranteed to be integer divisible by step_freq.

Return type:

int

adjust_n_for(batch_size, step_freq=1, n=None)

Number of data points reduced to be suitably integer-divisible.

This is a helper method for users to implement the __call__ method in the case of step_freq > 1. Taking only the returned number of data points guarantees that all batches have the same size and that there will be no “left-over” batches at the end of the epoch.

Parameters:
  • batch_size (int) – The desired number of data points in one batch.

  • step_freq (int, optional) – In case this number is > 1, the optimizer will accumulate gradients for that many batches before taking a step. All batches should be of the same size in this case and there shouldn’t be any “left-over” batches at the end of each epoch. Defaults to 1.

  • n (int, optional) – In rare cases, it might be useful to pass in the number of data points to adjust rather than taking the number of data points returned by self.n, which is the default.

Returns:

Reduced number of data points that is guaranteed to be integer divisible by the product of batch_size and step_freq.

Return type:

int

property n

Total number of training sequences.

sample(batch_size, max_n=None)[source]

Reproducible sample of training data for model evaluation.

Parameters:
  • batch_size (int) – The desired batch size. If the number of sequences is not integer divisible by that number, one of the batches may be smaller.

  • max_n (int, optional) – Approximate maximum number of sequences to provide in the sample. If enough are available, also the last batch will be filled up to the specified batch_size. If not given, all sequences will be provided. Defaults to None.

Returns:

The items produced by the iterator are 2-tuples, with the first element being a one-tuple with a single batch of data with dimensions (batch_size, seq_len) and the second being a tensor of the same dimensions representing the target token ids, i.e., the input shifted by one position.

Return type:

Iterator

Note

To reproducibly compute a training loss or error, sequences will never be randomized in any way, regardless of shuffle and jitter.

property seq_len

Length of (padded) training sequences.