Revealing the Inner Workings of LLMs: A Unique Value Perspective | by Louis Owen | June 2024

Now, let's get into the true meaning of this article. Analyzing matrices (Q, K, V, O) of the Llama-3–8B-Instruct model through their singular values!

The code

Let's first import all the packages needed in this analysis.

import transformers
import torch
import numpy as np
from transformers import AutoConfig, LlamaModel
from safetensors import safe_open
import os
import matplotlib.pyplot as plt

Then, let's download the model and save it to our location. /tmpdirectory.

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
!huggingface-cli download {MODEL_ID} --quiet --local-dir /tmp/{MODEL_ID}

If you have a GPU, the following code may not be relevant to you. However, if like me you don't have enough GPU, the following code will be really useful to load only specific layers of the LLama-3–8B model.

def load_specific_layers_safetensors(model, model_name, layer_to_load):
state_dict = {}
files = (f for f in os.listdir(model_name) if f.endswith('.safetensors'))
for file in files:
filepath = os.path.join(model_name, file)
with safe_open(filepath, framework="pt") as f:
for key in f.keys():
if f"layers.{layer_to_load}." in key:
new_key = key.replace(f"model.layers.{layer_to_load}.", 'layers.0.')
state_dict(new_key) = f.get_tensor(key)missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
if missing_keys:
print(f"Missing keys: {missing_keys}")
if unexpected_keys:
print(f"Unexpected keys: {unexpected_keys}")

The reason we do this is because the Google Colab GPU free tier is not enough to load LLama-3–8B even with fp16 precision. Furthermore, this analysis requires that we work on fp32 accuracy due to how np.linalg.svd It is built. Next, we can define the main function to obtain singular values for a given matrix_type , layer_number and head_number.

def get_singular_values(model_path, matrix_type, layer_number, head_number):
"""
Computes the singular values of the specified matrix in the Llama-3 model.Parameters:
model_path (str): Path to the model
matrix_type (str): Type of matrix ('q', 'k', 'v', 'o')
layer_number (int): Layer number (0 to 31)
head_number (int): Head number (0 to 31)
Returns:
np.array: Array of singular values
"""
assert matrix_type in ('q', 'k', 'v', 'o'), "Invalid matrix type"
assert 0 <= layer_number < 32, "Invalid layer number"
assert 0 <= head_number < 32, "Invalid head number"
# Load the model only for that specific layer since we have limited RAM even after using fp16
config = AutoConfig.from_pretrained(model_path)
config.num_hidden_layers = 1
model = LlamaModel(config)
load_specific_layers_safetensors(model, model_path, layer_number)
# Access the specified layer
# Always index 0 since we have loaded for the specific layer
layer = model.layers(0)
# Determine the size of each head
num_heads = layer.self_attn.num_heads
head_dim = layer.self_attn.head_dim
# Access the specified matrix
weight_matrix = getattr(layer.self_attn, f"{matrix_type}_proj").weight.detach().numpy()
if matrix_type in ('q','o'):
start = head_number * head_dim
end = (head_number + 1) * head_dim
else:  # 'k', 'v' matrices
# Adjust the head_number based on num_key_value_heads
# This is done since llama3-8b use Grouped Query Attention
num_key_value_groups = num_heads // config.num_key_value_heads
head_number_kv = head_number // num_key_value_groups
start = head_number_kv * head_dim
end = (head_number_kv + 1) * head_dim
# Extract the weights for the specified head
if matrix_type in ('q', 'k', 'v'):
weight_matrix = weight_matrix(start:end, :)
else:  # 'o' matrix
weight_matrix = weight_matrix(:, start:end)
# Compute singular values
singular_values = np.linalg.svd(weight_matrix, compute_uv=False)
del model, config
return list(singular_values)

It is worth noting that we can extract the weights for the head specified in the K, Q, and V matrices by performing row-wise cutting because of how it is implemented by HugsFace.

Implementation of Q, K, V matrices in HuggingFace. Note that in PyTorch the dimension of the array will be in `(d_out,d_in)`. Source: Author's image.

As for the matrix O, we can perform column cuts to extract the weights for the head specified in the weight O thanks to linear algebra. The details can be seen in the following figure.

Reasoning why we can extract the head specified in the weight matrix O by performing a column cut. Source: Author's image.

Revealing the Inner Workings of LLMs: A Unique Value Perspective | by Louis Owen | June 2024

Technical Terrence Team

The same airline ranked worst for delays for the third year in a row

Leave a Reply Cancel reply

Recommended.

Ethereum Price Stagnant Near $3K, Indicators Show Risk of Downside Breakout

Bringin makes Bitcoin easier to spend in Europe

Elon Musk Teases ‘Most Expensive’ Ad-Free Twitter Blue Subscription

South Korean convenience store chain Emart24 to sell Bitcoin food boxes

More electronic devices exploded in Lebanon a day after a coordinated pager attack

Categories

Important Links

Revealing the Inner Workings of LLMs: A Unique Value Perspective | by Louis Owen | June 2024

The code

Related

Technical Terrence Team

The same airline ranked worst for delays for the third year in a row

Leave a Reply Cancel reply

Recommended.

Ethereum Price Stagnant Near $3K, Indicators Show Risk of Downside Breakout

Bringin makes Bitcoin easier to spend in Europe

Elon Musk Teases ‘Most Expensive’ Ad-Free Twitter Blue Subscription

South Korean convenience store chain Emart24 to sell Bitcoin food boxes

More electronic devices exploded in Lebanon a day after a coordinated pager attack

Categories

Important Links

Get daily news updates to your inbox!