## Neural Network Quantization

This article is a short summary of quantization techniques.

- What is quantization?
- Fake Quant
- Min & Max
- quantization for LSTM/ RNN/ GRU

# What is quantization?

From 32-bit floating point representation to 8-bit fixed-point representation.

# Why Quantization?

**Arithmetic with lower bit-depth is faster**- In moving from 32-bits to 8-bits, we get (almost)
**4x reduction in memory**straightaway . Less storage space, smallerbandwith required. - Floating point arithmetic is not supported on some embedded devices.

# Why does it work?

- First, DNNs are known to be quite robust to noise and other small perturbations once trained.
- The weights and activations by a particular layer often tend to lie in a small range, which can be estimated beforehand. This means we don’t need the ability to store 10⁶ and 1/10⁶ in the same data type — allowing us to concentrate our precious fewer bits within a smaller range, say -3 to +3.

## Why still training with FP32?

Models are trained using very tiny gradient updates, for which we *do* need high precision.

## Fake Quantization

## Quantization is TensorFLow

## Quantization in TensorFlow Lite

## Reference

8-Bit Quantization and TensorFlow Lite

How to Quantize Neural Networks with TensorFlow – Pete Warden