Antah AI

Non-functional requirements are key for any project development because they provide an overall idea of the infrastructure needed for a project. Generative AI is a very hot topic, and it is well known that estimating infrastructure is essential, especially for Generative AI and large language models. If you are using a self-hosted model, having a basic idea of memory and CPU requirements gives a good overview of the project and sometimes even the cost, which is a very important factor in determining whether a project should move forward. Generative AI projects are computation-heavy, and understanding infrastructure needs can give a significant advantage in evaluating cost versus benefit.

I want to discuss the concept of quantisation in large language models—how it impacts inference and what can be done to reduce cost.

Transformer models determine how the model reads the prompt, understands the context, and how layers, attention, and embeddings work together to predict the next word. If you want to learn more, read my article: https://antahai.com/articles/inference-in-large-language-models

Quantisation is about how the model is packaged. It is a technique used to make models smaller and faster. It reduces memory usage and speeds up inference. It only changes how the model weights are stored. Since we know the model uses numerical values, there are different levels of quantisation.

FP32 stores numbers as 32-bit floating-point values (highest precision).
FP16 reduces the precision but maintains good accuracy.
INT8 and INT4 store 8-bit and 4-bit numbers respectively.

Even though the precision decreases, the core information remains, and the model’s meaning does not change. However, quantisation significantly changes the computation and memory requirements.

Below are the estimated memory requirements of a 7B parameter model:

FP32 ≈ 28 GB (≈ 26.1 GiB)
FP16 ≈ 14 GB (≈ 13.0 GiB)
INT8 ≈ 7 GB (≈ 6.5 GiB)
INT4 ≈ 3.5 GB (≈ 3.3 GiB)

If you are using FP16 or INT8, you definitely need a GPU. But here is the trickiest part: you can run inference on INT8 and INT4 quantised models even on a CPU.

You might be wondering how I have hosted my model. It is an INT8 quantised model, and I am able to perform LLM inference at almost no cost. The results are impressive—Medha can communicate in multiple languages. There is, of course, some latency, but it is completely acceptable considering the cost.

Quantisation in Large Language Models

Leave a Comment

Comments (0)