You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Shanghai Jiao Tong University & Tsinghua University & William and Mary
Introduction
We introduce a rank-based metric called Diff-eRank, which is rooted in information theory and geometry principles. Diff-eRank evaluates LLMs by examining their hidden representations to quantify how LLMs discard redundant information after training.
Specifically, we demonstrate its applicability in both single-modal (language) and multi-modal settings. For language models, our findings reveal that the Diff-eRank increases when the model scales up, which also demonstrates a consistent relationship with traditional metrics like loss and accuracy.
For multi-modal models, we also propose an evaluation method based on rank for assessing alignment quality and we find that modern multi-modal large language models exhibit good alignment performance.
Calculation of Diff-eRank for LLMs
Setup
pip install transformers torch datasets
Calculation
from transformers import AutoTokenizer, AutoModel, AutoConfig
import torch
import math
# R input N*d
def normalize(R):
with torch.no_grad():
mean = R.mean(dim=0)
R = R - mean
norms = torch.norm(R, p=2, dim=1, keepdim=True)
R = R/norms
return R
def cal_cov(R):
with torch.no_grad():
Z = torch.nn.functional.normalize(R, dim=1)
A = torch.matmul(Z.T, Z)/Z.shape[0]
return A
def cal_erank(A):
with torch.no_grad():
eig_val = torch.svd(A / torch.trace(A))[1]
entropy = - (eig_val *torch.log(eig_val)).nansum().item()
erank = math.exp(entropy)
return erank
def compute(R):
return cal_erank(cal_cov(normalize(R)))
model_path = "facebook/opt-1.3b"# for example
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path).cuda()
config = AutoConfig.from_pretrained(model_path)
untrained_model = AutoModel.from_config(config).to('cuda')
text = "We introduce a rank-based metric called Diff-eRank, which is rooted in information theory and geometry principles. Diff-eRank evaluates LLMs by examining their hidden representations to quantify how LLMs discard redundant information after training."# for example
inputs = tokenizer(text, return_tensors="pt").to('cuda')
with torch.no_grad():
R1 = model(inputs.input_ids)[0][0, :, :]
R2 = untrained_model(inputs.input_ids)[0][0, :, :]
erank1 = compute(R1)
erank2 = compute(R2)
RD = erank2 - erank1
print(RD)
We provide an example script to calculate eRank for Qwen2.5-VL. Please check it in utils/erank-qwen2_5_vl.py
Citation
If you're using Diff-eRank in your research or applications, please cite using this BibTeX:
@inproceedings{weidiff,
title={Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models},
author={Wei, Lai and Tan, Zhiquan and Li, Chenghai and Wang, Jindong and Huang, Weiran},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
About
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models