SCAO - Experience faster LLM training with a second-order optimizer.

SCAO

Experience faster LLM training with a second-order optimizer.

Pitch

SCAO is a high-throughput, drop-in replacement for AdamW, designed to enhance LLM training speed by 54%. Leveraging sparse, second-order methods, this PyTorch optimizer offers Shampoo-quality preconditioned gradients with minimal overhead, making it an effective solution for optimizing performance in large neural networks.

Description

SCAO — Sparse Curvature-Aware Adaptive Optimizer

SCAO is a cutting-edge optimizer tailored for enhancing large neural network training processes. It provides a sparse, second-order optimization experience in PyTorch that rivals the efficiency of AdamW, with a remarkable 54% increase in throughput. Designed to be a high-performance, drop-in replacement for AdamW, SCAO utilizes advanced techniques to capitalize on the intricate relationships among model parameters, improving both speed and performance.

The Challenge of Current Optimizers

Training substantial neural networks is predominantly reliant on first-order methods. Among these, AdamW has emerged as a standard approach. However, AdamW has inherent limitations—primarily its reliance on a diagonal approximation of loss curvature that fails to account for the complex correlations present within network parameters. This deficiency can hinder performance, especially as model size and complexity grow.

The SCAO Advantage

SCAO introduces significant innovations to help overcome these limitations:

Adaptive Rank Selection: Instead of utilizing full curvature matrices, SCAO judiciously retains only the top-k eigenvectors which encompass over 95% of spectral mass, resulting in a drastic reduction in memory usage.
```
k* = argmin k such that Σᵢ₌₁ᵏ λᵢ / Σⱼ λⱼ ≥ 1 − ε
```
Sparse Block-Diagonal FIM: For layers exceeding predefined dimensions, SCAO implements a diagonal curvature approximation to maintain efficiency while ensuring adaptivity at scale.
Phase-Transition Stability: Transitioning from Adam to SCAO preconditioning is critical. SCAO employs several strategies to maintain stability during this phase, which include initializing Kronecker factors to mitigate rank deficiencies and gradually blending Adam and pre-conditioned gradients.

Algorithm Overview

SCAO operates in two distinct phases:

Phase 1: Implements a warmup using standard Adam or AdamW updates to establish initial gradient behavior. During this phase, curvature estimates are built.
Phase 2: Activates SCAO's curvature preconditioning, updating every defined frequency and applying sophisticated adjustments like Tikhonov regularization and eigenvalue truncation to optimize gradient projections.

Experimental Results

In assessing SCAO against AdamW on benchmarks such as WikiText-2, it demonstrated improved performance metrics:

SCAO achieved a 54% increase in throughput compared to AdamW.
Even at peak performance, SCAO maintained lower average training loss across iterations.

These results underline SCAO's effectiveness in optimizing training time while enhancing model quality.

Practical Implementation

Using SCAO is intuitive—transitioning from AdamW only requires minimal code changes:

optimizer = SCAO(model.parameters(), lr=3.5e-4, weight_decay=0.1,
                 warmup_steps=100, precond_freq=10)

Conclusion

By integrating advanced optimization strategies with robust performance characteristics, SCAO presents a formidable alternative to traditional optimizers in large-scale neural network training. The optimizer not only expedites convergence but also retains high-quality parameter updates, making it essential for modern machine learning practices.

0 comments

No comments yet.

New comment