data

Flexibly fold and pad input data into sequences of equal length and then wrap them into custom data producers to be used in the training loop as well as for validation.

Refer to the pertinent documentation of the swak package (available on PyPI) for how to best use these data producers in the best training loop.

class TestSequenceFolder(seq_len, pad_id=0)[source]

Bases: ArgRepr

Warp and right-pad a single sequence into many with specified length.

Parameters:

seq_len (int) – The desired sequence length. Because, in next-token prediction, the target sequence is the source offset by one, the output tensor size will be one more than this number in its second and last dimension.
pad_id (int, optional) – Integer index to pad sequences with so that all parts have the same length, that is, seq_len + 1. Defaults to 0. Make sure that this index is consistently used also for training embeddings and in your loss function!

Raises:

TypeError – If seq_len is not an integer.
ValueError – If seq_len is smaller than 2.

__call__(sequence)[source]

Warp and right-pad a single sequence into many of the given length.

Each row of the output can then conveniently be split into source (row[:-1]) and target (row[1:]) for next-token prediction such that both have the given seq_len.

Parameters:: sequence (Tensor) – The input sequence to warp and pad. Must be a 1-dimensional tensor.
Returns:: Output tensor of size seq_len + 1 in its second and last dimension and as many rows as needed to accommodate all elements in its first dimension.
Return type:: Tensor

Note

Sequences that are folded into more than one row will never have any row that has fewer than 2 non-padding entries because these would be useless in evaluating next-token prediction. However, empty sequences or sequences of length 1 will still be padded to a tensor of sizes 1 and seq_len + 1 in their first and second dimension, respectively. Consequently, they should be filtered out beforehand!

property width: Size of the output tensor in its second and last dimension.

class TestData(seqs, device='cpu', dtype=torch.float32)[source]

Bases: TestDataBase

Wraps test and validation data to provide batches over a sample.

Parameters:

seqs (Tensor) – A PyTorch tensor with dimensions (N, S + 1) of dtype int64 or, equivalently, long, where N is the number of (padded) test/validation sequences and S is the (padded) sequence length. The “+1” is needed to provide the target for next-token prediction. Typically, this tensor will reside in CPU memory.
device (str or device, optional) – Torch device to push individual batches of data to. Defaults to “cpu”, but will typically be “cuda”.