Neural Network Quantization

This article is a short summary of quantization techniques.

  1. What is quantization?
  2. Fake Quant
  3. Min & Max
  4. quantization for LSTM/ RNN/ GRU

What is quantization?

From 32-bit floating point representation to 8-bit fixed-point representation.

Why Quantization?

  1. Arithmetic with lower bit-depth is faster
  2. In moving from 32-bits to 8-bits, we get (almost) 4x reduction in memory straightaway. Less storage space, smaller bandwith required.
  3. Floating point arithmetic is not supported on some embedded devices.

Why does it work?

  1. First, DNNs are known to be quite robust to noise and other small perturbations once trained.
  2. The weights and activations by a particular layer often tend to lie in a small range, which can be estimated beforehand. This means we don’t need the ability to store 10⁶ and 1/10⁶ in the same data type — allowing us to concentrate our precious fewer bits within a smaller range, say -3 to +3.

Why still training with FP32?

Models are trained using very tiny gradient updates, for which we do need high precision.

Fake Quantization

Quantization is TensorFLow

Quantization in TensorFlow Lite

Reference

8-Bit Quantization and TensorFlow Lite

How to Quantize Neural Networks with TensorFlow – Pete Warden

https://github.com/tensorflow/tensorflow/blob/c7a437acd2a49f54e143daae7084f5233aff8d35/tensorflow/lite/g3doc/performance/quantization_spec.md