Mixed Precision

Neural Networks

Using lower precision where safe to speed training/inference

Combines FP16 and FP32 to reduce memory and increase throughput, while preserving model accuracy with careful scaling.

Learn more about concepts related to Mixed Precision

Quantization

Reducing numerical precision to shrink models and speed inference

GPU

Graphics Processing Unit used for parallel compute

CUDA

Parallel computing platform and API for NVIDIA GPUs