Blog cover

Quantisation Notes

Comprehensive notes on quantisation

Introduction

There are 4 methods to optimise a model, GPTQ, activation aware quantized training, bits and bytes , packages like huggingface optimum, or pytorch api itself.

Benefits of Scripting Models

Scripting the model can make inference faster for a few reasons:

  • Reduced Overhead: Scripted models can have lower overhead compared to their original Python counterparts because the script represents a more optimized version of the model's forward pass. This can lead to faster execution times.
  • Optimizations: When you use torch.jit.script to script a model, PyTorch applies various optimizations to the script, such as constant folding and operator fusion, which can improve performance during inference.
  • Parallelization: Scripted models can take advantage of parallelization opportunities more effectively, especially when deployed on hardware accelerators like GPUs, due to the way the operations are organized and optimized in the script.
  • Serialization: Scripted models can be serialized and deserialized more efficiently, which is important for deployment scenarios where models need to be loaded quickly into memory.
  • Platform Independence: Scripted models are platform-independent once they are compiled, meaning they can be executed on any platform that supports PyTorch without needing the original Python code, which can be beneficial for deployment in different environments.

Overall, scripting a model can lead to faster inference times due to these optimizations and efficiencies, especially in production environments where speed and resource usage are critical.

 

Benefits of Scripting Models

Scripting the model can make inference faster for a few reasons:

  • Reduced Overhead: Scripted models can have lower overhead compared to their original Python counterparts because the script represents a more optimized version of the model's forward pass. This can lead to faster execution times.
  • Optimizations: When you use torch.jit.script to script a model, PyTorch applies various optimizations to the script, such as constant folding and operator fusion, which can improve performance during inference.
  • Parallelization: Scripted models can take advantage of parallelization opportunities more effectively, especially when deployed on hardware accelerators like GPUs, due to the way the operations are organized and optimized in the script.
  • Serialization: Scripted models can be serialized and deserialized more efficiently, which is important for deployment scenarios where models need to be loaded quickly into memory.
  • Platform Independence: Scripted models are platform-independent once they are compiled, meaning they can be executed on any platform that supports PyTorch without needing the original Python code, which can be beneficial for deployment in different environments.

Overall, scripting a model can lead to faster inference times due to these optimizations and efficiencies, especially in production environments where speed and resource usage are critical.

 

Quantization

Quantization is the process of converting a model to use fewer bits to represent weights and activations, usually from 32-bit floating point to 8-bit integers (or even lower bit representations). This can significantly reduce the model size and improve inference speed, especially on hardware that supports low-precision operations efficiently.

PyTorch provides tools like torch.quantization module to help with quantization, including functions to prepare the model for quantization (torch.quantization.prepare) and to actually quantize the model (torch.quantization.convert).

In summary, tracing is about capturing the operations of a model to create a script representation for efficient inference, while quantization is about reducing the precision of the model's parameters and activations to improve performance and reduce memory footprint.