LLM Inferencing Basics

LLM Inferencing Basics

November 1, 2023

LLMs inferencing costs way more than its training. One might think that by optimizing model training and its architecture, its inferencing will also be optimal. But this is not the case. LLMs generates there output iteratively and most of the time it is memory bound and not compute bound. Inference optimization techniques like KV cache, continuous batching, inflight batching,

We're going to dive pretty deep into the rabbit hole, and having a good understanding of machine learning and deep learning is a must for this blog. But if you think you're ready for it 😇, grab a cup of coffee and get ready to dive in deep, because this blog is on LLMs. My name is Rahul, and Welcome to TechTalk Verse! 🤟

Parameters

First let’s calculate apporoximate numbers of parameters in an LLM architecture. Consider \( n_{l} \) as number of decoder layers and each decoder layer consists of self attention layer and a feed forward network. \( d_{model} \) is the token embedding dimention. \( n_{heads} \) is number of heads and \( d_{head} \) is dimention of each head. Now, weights per layer consists of following :

  1. \( W_{q}, W_{k}, W_{v} \) i.e., key, query and value matrices. And each matrix is of size \( d_{model}.n_{heads}.d_{head} \).
  2. \( W_{o} \) matrix which is used on the output of self-attention layer and before feed forward layer. \( W_{o} \) matrix is also of size \( d_{model}.n_{heads}.d_{head} \).
  3. Feed Forward Network consists of 2 layers and each of size \( {d_{model}}^{2}.4 \) .

Also, in most of the tranformer architecture, \( d_{model} = n_{heads}.d_{head} \) So, number of parameters in a typical LLM block
=> \( 4.d_{model}.n_{heads}.d_{head} \) (K, Q, V, \( W_{o} \) matrices) + \( {d_{model}}^{2}.4.2 \) (feed forward layers).
=> No. of parameters, P = \( 12.n_{layers}.{d_{model}}^{2} \)

Batching

Flops vs Memory Boundness

KV Cache

Calculating for a kv cache token is exactly 1/6th of the compute of passing the token through the model.

When you’re doing auto-regressive text generation, you predict one token at a time. When predicting a given token, in the attention layer, you need to compute the attention between the most recent token and all tokens generated so far – you use the query from the last token, but the key and the value from all tokens generated so far. This means you have no benefit in caching the query, but you save a few computations if you cache the key and the value

Paged Attention

Flash Attention

Faster Transformer

Model Parallelism

tensor parallelism and pipeline paralleslism Tensor parallelism occurs when each tensor is split up into multiple chunks, and each chunk of the tensor can be placed on a separate GPU. During computation, each chunk gets processed separately in-parallel on different GPUs and the results (final tensor) can be computed by combining results from multiple GPUs.

Pipeline parallelism occurs when a model is split up in-depth and different full layers are placed onto different GPUs/nodes.

https://developer.nvidia.com/blog/accelerated-inference-for-large-transformer-models-using-nvidia-fastertransformer-and-nvidia-triton-inference-server/

References