massOfai

Quantization

Neural Networks

Reducing numerical precision to shrink models and speed inference

What is Quantization?

Convert weights/activations from 32-bit float to 16-bit or 8-bit integers, with hardware support boosting throughput.

Real-World Examples

  • INT8 inference on CPUs/TPUs
  • Mixed precision training using FP16