Computational and Applied Mathematics Seminar
Oct
8
2025

Oct
8
2025
Description
Quantization compresses neural networks by representing weights and activations with few bits, reducing memory, computation time, and energy while preserving inference accuracy.
We analyze OPTQ, a widely used quantization algorithm in the literature. We provide new theory: an error-evolution identity, layerwise error bounds, and theoretical justification for heuristics used in practice, including for feature ordering, regularization, and alphabet size. We further study a stochastic variant that yields entrywise control on the error. With these results in hand we introduce Qronos, a new related algorithm that first corrects errors resulting from previous layers, and thus attains stronger guarantees . We conclude with numerical results on modern language models. However, the underlying optimization problems are NP-hard in general. So, one must settle for computationally efficient approximate solutions, ideally ones with theoretical error guarantees.