Welcome to slangmod's documentation! ==================================== *Small language model.* Ever wondered how large language models (LLMs) like ChatGPT, Claude, LLama, Deepseek, *etc.*, actually work, like, *really* work? I did. And I figured there is only one way to find out: Make one yourself. From scratch. Of course, I wasn't expecting to beat the big players at their own game, but I wanted to know what you can do on consumer hardware (meaning a state-of-the art gaming PC with a single graphics card supported by `PyTorch `_). So, naturally, it was going to be a *small* language model. These hardware limitations are reflected in software design choices. Specifically, :mod:`slangmod` does *not* employ any type of parallelization that would keep multiple GPUs busy at the same time, and *all* training data are loaded into CPU RAM at once, to be drip-fed to the model on the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit integer numbers). Having said that, :mod:`slangmod` provides everything you need to - preprocess and clean your text corpus; - chose and train one of the HuggingFace `tokenizers `_; - specify a Transformer model including the type of positional encodings and the feedforward block; - train your model with a choice of optimizers and learning-rate schedulers, employing early-stopping if you like; - monitor convergence and experiment on hyperparameters; - explore text-generation algorithms like top-k, top-p or beamsearch; - and, finally, chat with your model. To do all these things, :mod:`slangmod` provides a command-line interface (CLI) with fine-grained configuration options on one hand, and the raw building blocks it is made of on the other hand. Leveraging the foundational functionalities provided by the `swak `_ package, any other workflow can thus be quickly coded up. .. toctree:: :hidden: :maxdepth: 1 :caption: Usage usage/installation usage/configuration usage/data usage/strategy .. toctree:: :hidden: :maxdepth: 1 :caption: CLI Reference cli/clean cli/tokenize cli/encode cli/train cli/chat .. toctree:: :hidden: :maxdepth: 1 :caption: API Reference api/io api/etl api/ml Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`