This repository offers a comprehensive look at Large Language Models, combining theory and practical implementation. By building machine learning models from the ground up, it deepens understanding of their architectures, training pipelines, and optimization techniques, all while detailing the mathematical principles and design choices behind each component.
The llm-from-scratch repository offers an in-depth exploration of Large Language Models (LLMs) through both theoretical insights and practical implementations. This project is designed to provide a comprehensive understanding of machine learning by building models from the ground up. Each component is meticulously crafted with detailed explanations covering the mathematical principles, intuition, and design choices involved in the development process.
Key Features
Architectures
- llama3: A baseline transformer decoder that incorporates grouped query attention and rotary embeddings, forming the cornerstone for modern LLMs.
- mixtral8x7b: Introduces a sparse mixture of experts, activating only select experts per token for increased model capacity without a proportional rise in computational demand.
- deepseekv3: Implements multi-head latent attention to streamline the key-value (kv) cache, enhancing efficiency for long context processes using load-balanced mixture of experts.
- kimik2: Optimizes rope parameters and attention patterns for managing sequences extending into millions of tokens, focusing on extremely long contexts.
- paligemma2: A vision-language model that adeptly processes both images and text through a unified transformer architecture.
- stable diffusion: Utilizes diffusion models to generate images from noisy inputs through an iterative denoising process.
Training Pipelines
- Pretraining: Equip models with the ability to predict subsequent tokens by training on extensive text corpora, where they learn language patterns, factual knowledge, and reasoning capabilities.
- Supervised Finetuning (SFT): Focuses computing loss on response pairs to teach models conversational skills and how to follow specific instructions.
- Reinforcement Learning (RL): Adopts direct preference optimization to align model responses with human preferences, operating on chosen versus rejected pair comparisons for stability and simplicity.
- Distillation: Compresses larger teacher models into smaller students by training on soft probability distributions instead of hard labels.
Optimization Techniques
- LoRA: Implements efficient finetuning by utilizing low-rank matrices, focusing training on less than 1% of parameters by introducing small adapters within attention layers.
- Inference optimizations: Includes quantization techniques to condense model sizes to INT8, improves throughput via batched inference, and manages kv cache efficiently.
Fundamentals
- PyTorch: Establishes foundational concepts from scratch, including tensors, autograd, modules, optimizers, and training loops, essential for a robust understanding of machine learning.
- CNNs: Covers convolutional neural networks for vision tasks, with practical implementations including convolution, pooling, and ResNet blocks for image classification (e.g., on CIFAR10).
- LSTMs: Explores recurrent networks with practical applications such as vanilla RNNs, LSTM cells for sequence prediction, and sentiment analysis.
- VAEs: Focuses on variational autoencoders that enable generative modeling with latent spaces for the creation of new samples.
This repository serves as a valuable resource for those seeking to deepen their machine learning expertise, offering intricate details about various models, training strategies, optimization techniques, and foundational concepts.
No comments yet.
Sign in to be the first to comment.