Blog cover

Compressing LLM’s with Low Rank Decomposition

Tricks for Compressing Large Language Models

Compressing LLM’s with Low Rank Decomposition

 

According to LLM scaling Laws, larger a model is the better it performs. LLM’s in production today have over 100 billion parameters and some are even known to touch even a trillion parameters. And why wouldn’t they be, they are modelling a good chunk of natural language and making it available to us by storing these information in these parameters

But the application of these LLM’s are impeded by their computational requirements, often needing heavy duty GPU’s (with multiple cores and several GB’s of RAM). Serving these models which can occupy several GB’s of compute memory is a challenge.

This blog is about introducing a simple method to reduce the size of you LLM, using Low rank decomposition of the model weights. This method takes minutes to run (20 min for a 7B parameter model) and can be done on commercially available CPU’s and has better performance, when compared to compression techniques like pruning which again need GPU’s to perform.

Why to compress LLM’s

Not using OpenAI api’s for cost and safety reasons

Ofcourse interatiing OpenAI API’s is always a viable option, but it is costly to do so and there is always a chance that OpenAI’s outputs are not tailored to your specific use case. Product owners might want to finetune an LLM on their own custom data and therefore, especially with the rise of strong Open source models like LLAMA , Falcon and Mistral. Users might have security and privacy concerns as well about sending their data to third party applications.

Smaller hardware requirements

Deploying a LLAMA 7B model which itself occupies 13GB of memory will be challenge. Which means you have to use a GPU with at least 16 GB of VRAM (An NVIDIA T4). Lets say you have used an NVIDIA P4 in your deployment to serve multiple instances of the LLAMA model. On GCP the hourly rate of a P4 is nearly 0.6$ per hour .Compressing your model by 50% means you would now be able to shift to a T4 machine(0.35$/hour), which is available at nearly half the cost of a P4.

Faster inferencing

LLM inference is autoregressive in nature, that is the output of a token initiates the output of the next token. This sequential execution requires forward pass through the model several times untill the complete output is generated. Due to this nature it becomes important to use a comparitively smaller model . Moreover, due to variable length of the output, generalizing latency for an average user prompt becomes challenging. This is why it is best to try to keep the model size as small as possible!

Tradeoff between compression and accuracy

Any Model compression technique comes with a tradeoff in accuracy and performance of the model, which is why choosing an algorithms which can give a good performance for as much compression as possible becomes important.

Model Compression Techniques

Pruning

Pruning is the process of identifying unnecessary parameters — that have little impact or redundant — and removing them. Sparsity — one form of pruning — is the process of replacing near zero values by zero and represent the matrix in a condensed form (only non-zero values and indices) that takes up less space than a full, dense matrix.

There are two types of Pruning:

  1. Structured Pruning — Structured pruning reducing the model size by removing entire structural components, like neurons, channels or layers. This leads to significant reductions in the size of the model while maintaining the overall LLM structure intact. Compared to unstructured pruning, structured pruning offers more control and scales well for larger models.
  1. Unstructured Pruning — Unstructured pruning is a simple technique that targets individual weights or neurons by applying a threshold and zeroing out parameters below it. It does not consider the overall LLM structure and results in an irregular sparse model required specialised techniques.

Unstructured pruning often requires additional fine-tuning (retraining) to regain accuracy. For massive models with billions of parameters, unstructured pruning can become inefficient and time-consuming. Various techniques like iterative fine-tuning during pruning to minimize the training steps, combining parameter-efficient tuning (PEFT) with pruning and SparseGPT are used to address this issue.

Quantisation:

Quantisation is a technique that reduces the precision of the model’s weights (and sometimes also the activations) to significantly reduce the size of the model — leading to reduced storage and bandwidth requirement, and computational complexity. The typical 32-bit representation of the model is converted to either 4-bit integers, or 8-bit floating point numbers or 16-bit floating point numbers.In this case the model requires 1 ⁄ 4 of memory compared to the unquantized version, which allows for deployments with less GPUs and therefore making it more affordable.

Distillation

Knowledge distillation (KD) is a technique aimed at transferring knowledge from a large, complex model (the “teacher”) to a smaller, simpler model (the “student”) with a smaller footprint. The student is trained to mirror the teacher — by using an additional loss function that measures the discrepancy between their outputs (this is in addition to the original loss function that matches with the ground-truth labels). And the current trend is Ensemble KD where multiple teachers are used to train the student

It is important to note that many state-of-the-art large language models (LLMs) have restrictive licenses that prohibit using their outputs to train other LLMs. Open source LLMs, limited-use licenses and synthetic data are the alternatives to find teacher models.

Although this technique ensures maximum retention of knowledge in the student model, we again need GPU’s to train the model along with data and time!

 

Reduced Order Modelling for LLM’s

In the transformer architecture, we often encounter the simple MLP layer, which accounts to more than 95% of the models parameters. This layer is used in two places , first where output from the previous layer is converted into Query , keys and values for the attention mechanism. and in the feed forward layer, where all the outputs from the different heads are mixed up. This layer is mathematically written as :

Reduced Order Modelling simply uses Low Rank Decomposition (Matrix factorisation) of model weights to reduce the number of parameters needed for a matric multiplication. For Example if lets say a mattric of size 18x12 (216 parameters) could be written as the product of two matrices and of size 18x4 and 4x12 (76 + 48 = 114 parameters).

notion image

This decomposition is done by Eigenvalue decomposition of and with the predecided rank of this layer, the weight matrix is re-parameterised into its two factors.

A few things to note here are

  • consider the ROM of the previous layer to generate inputs for the next layer so that the next layers have prior information of the error introduced in the previous layers for decomposition.
  • Note that the ROM operations are performed on CPU with no requirement for a GPU, and the computational cost associated with it is very small.
 

Size and Performance Comparison

 

To Evaluate the performance the of the compressed model we have performed zero-shot classifcation accross standard datasets like BoolQ, PIQA, HellaSwag, and ARC. Examples of the type of prompts in these datasets is as follows.

Boolean Questions dataset.
Boolean Questions dataset.
{
  "goal": "How do I ready a guinea pig cage for it's new occupants?",
  "sol1": "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.",
  "sol2": "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.",
  "label": 0,
}

Accross these of the datasets we can see the performance of the model to be comparable to the original LLAMA model and a significant difference in the Pruned versions of these models as well.

We also found that using larger batch sizes and sequence lengths for the computation of matrix factors leads to better generalization for that layer. So depending on your availbility of comptute resources , it is better to use batch size of 512 or more and a sequence length of 128 to compute the ROM for your model.

We found out that, it takes 15.8 minutes, 21.8 minutes and 28.9 minutes for 90%, 80% and 50% compression rates respectively, to compress the 7B parameter (13GB) model.

As the computation has to be done layer-wise, calculate the memory requirement for a single layer of your model and that will become the least necessary RAM required for compression using this algorithm. After that it will take on an average 13 seconds per layer for the compression.(Depends on the context length and batch size)

This indicates that without significant drop in performance of an LLM, we can safely use LLM-ROM as an algorithm to compress LLM’s for specific use cases.

 

Conclusion

We introduce a novel model compression algorithm based on Low rank decomposition which can be used on Consumer grade CPU’s , with retaining maximum performance as compared to the original model when compared to other compression techniques such as Pruning and Distillation. Also important to note that these methods come with their own caveats

Using LLM-ROM technique to compress a larger better performing version of the same model and using it instead of the smaller version of the model is also something product owners and chatbot builders should explore.

If you’re looking for ways to compress your huge LLM models to cut down those huge cloud bills, without using GPU’s to help you in doing so , LLM-Reduced Order Modelling is the way to go!