本地部署Qwen2.5-VL/Gemma3时如何计算图文相似度?
生成式VLM图文相似度计算问题
已尝试方案
Qwen2.5-VL实现代码
import torch from transformers import ( Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig, ) from transformers.image_utils import load_image from torch.nn.functional import normalize, cosine_similarity quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto", quantization_config=quant_config, ) min_pixels = 144 * 28 * 28 max_pixels = 256 * 28 * 28 processor = AutoProcessor.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels, use_fast=True, ) img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg") inputs = processor( text=["This is a photo of 2 cats."], images=[img], ).to("cuda") with torch.no_grad(): input_ids = torch.tensor(inputs['input_ids'], device='cuda') txt_embeds = model.model.embed_tokens(input_ids).to('cuda') img_embeds = model.visual(inputs['pixel_values'], grid_thw=inputs['image_grid_thw']).to('cuda') sim = cosine_similarity( normalize(img_embeds.mean(dim=0), dim=-1), normalize(txt_embeds.mean(dim=1), dim=-1), ) print("Cosine similarity:", sim.item())
运行结果:相似度为0.0069,数值过低无意义。
Gemma3实现代码
import torch from transformers import Gemma3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model_id = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quant_config, token=access_token, output_hidden_states=True, return_dict=True, ).eval() processor = AutoProcessor.from_pretrained( model_id, token=access_token, min_pixels=256*28*28, max_pixels=512*28*28, use_fast=True, ) img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg") inputs_img = processor( text="<start_of_image>", images=img, return_tensors="pt", padding=True ).to(model.device, dtype=torch.bfloat16) inputs_txt = processor( text="This is a photo of 2 cats.", return_tensors="pt" ).to(model.device, dtype=torch.bfloat16) with torch.no_grad(): img_tokens = model.get_image_features(pixel_values=inputs_img["pixel_values"]) img_feats = normalize(img_tokens.mean(dim=1), dim=-1) tok_embeds = model.get_input_embeddings()(inputs_txt["input_ids"]) tok_embeds = model( **inputs_txt, output_hidden_states=True, return_dict=True ).hidden_states[-1] txt_feats = normalize(tok_embeds.mean(dim=1), dim=-1) sim = cosine_similarity(img_feats, txt_feats) print("Cosine similarity:", sim.item())
运行结果:相似度为0.0403,数值过低无意义。
SigLIP对比实现
inputs = processor( text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt", ).to(model.device) with torch.no_grad(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = torch.sigmoid(logits_per_image) print("SigLIP similarity:", probs)
运行结果:使用google/siglip2-large-patch16-256得到0.164,google/siglip2-so400m-patch14-384得到0.245。
问题咨询
- 提取嵌入并计算余弦相似度的方法是否有误?
- 生成式VLM是否有推荐的池化或投影策略来生成类对比分数?
- 是否需要微调Qwen2.5-VL、Gemma3或添加小头部来对齐嵌入?
- 有没有适用于资源受限(6GB VRAM)场景的标准化生成式VLM图文相似度计算库?
- 是否存在可原生计算图文相似度的视觉对话模型?
解决方案与回答
1. 嵌入提取方法的问题
你的提取逻辑存在核心问题:生成式VLM的输入token嵌入、原始视觉特征并未经过对比训练的跨模态对齐,直接取均值无法得到有意义的相似度。
- 文本端:不能直接用
embed_tokens的输出,需要取模型最后一层隐藏状态中的特殊token(如/ ,而非所有token的均值;或任务专属结束token) - 视觉端:不能直接用
model.visual的原始输出,需要经过模型内置的跨模态投影层,将视觉特征映射到文本特征的同维度空间。
2. 推荐的池化与投影策略
- 文本特征:优先取第一个token(
)或最后一个有效token( ,避免用全局均值稀释关键信息;)的最后一层隐藏状态 - 视觉特征:使用模型提供的
get_image_features接口(若支持),或通过模型的visual_proj投影层将原始视觉特征映射到文本特征维度; - 归一化:对最终的图文特征都做L2归一化后,再计算余弦相似度。
3. 微调与头部添加建议
- 资源受限场景下,不建议微调整个模型,最优方案是添加一个轻量的单线性投影层(对齐图文特征维度),并在小规模图文配对数据集(如COCO子集)上微调投影层,6GB VRAM完全支持;
- 若使用Qwen2.5-VL的基础版(非Instruct版),其特征对齐性本身优于对话优化的Instruct版,可减少微调需求。
4. 资源受限场景的实现方案
无需额外库,直接基于transformers结合4bit量化即可实现:
- 优先使用模型原生的
get_text_features和get_image_features接口获取对齐后的特征; - 若模型无此类接口,手动提取特殊token的隐藏状态+添加轻量投影层即可,无需额外依赖。
5. 原生支持图文相似度的视觉对话模型
- BLIP-2:对比训练+生成式结合的模型,支持4bit量化,可直接输出对齐的图文特征;
- LLaVA-1.5对比分支:在对话模型基础上优化了特征对齐能力;
- Qwen2.5-VL基础版:未经过对话指令微调,特征对齐性更适合相似度计算。
修正后的代码示例
Qwen2.5-VL修正版
import torch from transformers import ( Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig, ) from transformers.image_utils import load_image from torch.nn.functional import normalize, cosine_similarity quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) # 使用基础版而非Instruct版,特征对齐性更好 model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-3B", torch_dtype="auto", device_map="auto", quantization_config=quant_config, ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B", use_fast=True) img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg") # 提取文本特征:使用原生接口获取对齐后的特征 text_inputs = processor(text=["This is a photo of 2 cats."], return_tensors="pt").to("cuda") with torch.no_grad(): text_feats = normalize(model.get_text_features(**text_inputs), dim=-1) # 提取图像特征:使用原生接口获取对齐后的特征 image_inputs = processor(images=[img], return_tensors="pt").to("cuda") with torch.no_grad(): image_feats = normalize(model.get_image_features(**image_inputs), dim=-1) # 计算相似度 sim = cosine_similarity(image_feats, text_feats) print("Cosine similarity:", sim.item())
Gemma3修正版
import torch from transformers import Gemma3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig from transformers.image_utils import load_image from torch.nn.functional import normalize, cosine_similarity quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model_id = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quant_config, output_hidden_states=True, return_dict=True, ).eval() processor = AutoProcessor.from_pretrained(model_id, use_fast=True) img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg") # 提取图像特征并对齐维度 image_inputs = processor(images=[img], return_tensors="pt").to(model.device, dtype=torch.bfloat16) with torch.no_grad(): img_feats = model.get_image_features(pixel_values=image_inputs["pixel_values"]) # 添加轻量投影层对齐文本特征维度 if img_feats.shape[-1] != model.config.hidden_size: proj_layer = torch.nn.Linear(img_feats.shape[-1], model.config.hidden_size, device=model.device, dtype=torch.bfloat16) img_feats = proj_layer(img_feats) img_feats = normalize(img_feats, dim=-1) # 提取文本特征:取<bos> token的最后一层隐藏状态 text_inputs = processor(text="This is a photo of 2 cats.", return_tensors="pt").to(model.device, dtype=torch.bfloat16) with torch.no_grad(): outputs = model(**text_inputs, output_hidden_states=True) txt_feats = outputs.hidden_states[-1][:, 0, :] # 取第一个token(<bos>) txt_feats = normalize(txt_feats, dim=-1) sim = cosine_similarity(img_feats, txt_feats) print("Cosine similarity:", sim.item())
内容的提问来源于stack exchange,提问作者H.H




