Neural Network Quantization
This article is a short summary of quantization techniques.
- What is quantization?
- Fake Quant
- Min & Max
- quantization for LSTM/ RNN/ GRU
What is quantization?
From 32-bit floating point representation to 8-bit fixed-point representation.
- Arithmetic with lower bit-depth is faster
- In moving from 32-bits to 8-bits, we get (almost) 4x reduction in memory
straightaway. Less storage space, smaller bandwithrequired.
- Floating point arithmetic is not supported on some embedded devices.
Why does it work?
- First, DNNs are known to be quite robust to noise and other small perturbations once trained.
- The weights and activations by a particular layer often tend to lie in a small range, which can be estimated beforehand. This means we don’t need the ability to store 10⁶ and 1/10⁶ in the same data type — allowing us to concentrate our precious fewer bits within a smaller range, say -3 to +3.
Why still training with FP32?
Models are trained using very tiny gradient updates, for which we do need high precision.