You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

本地部署Qwen2.5-VL/Gemma3时如何计算图文相似度?

生成式VLM图文相似度计算问题

已尝试方案

Qwen2.5-VL实现代码

import torch
from transformers import (
    Qwen2_5_VLForConditionalGeneration,
    AutoProcessor,
    BitsAndBytesConfig,
)
from transformers.image_utils import load_image
from torch.nn.functional import normalize, cosine_similarity

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quant_config,
)

min_pixels = 144 * 28 * 28
max_pixels = 256 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    min_pixels=min_pixels,
    max_pixels=max_pixels,
    use_fast=True,
)

img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg")

inputs = processor(
    text=["This is a photo of 2 cats."],
    images=[img],
).to("cuda")

with torch.no_grad():
    input_ids = torch.tensor(inputs['input_ids'], device='cuda')
    txt_embeds = model.model.embed_tokens(input_ids).to('cuda')
    img_embeds = model.visual(inputs['pixel_values'], grid_thw=inputs['image_grid_thw']).to('cuda')

sim = cosine_similarity(
    normalize(img_embeds.mean(dim=0), dim=-1),
    normalize(txt_embeds.mean(dim=1), dim=-1),
)
print("Cosine similarity:", sim.item())

运行结果:相似度为0.0069,数值过低无意义。

Gemma3实现代码

import torch
from transformers import Gemma3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quant_config,
    token=access_token,
    output_hidden_states=True,
    return_dict=True,
).eval()
processor = AutoProcessor.from_pretrained(
    model_id,
    token=access_token,
    min_pixels=256*28*28,
    max_pixels=512*28*28,
    use_fast=True,
)
img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg")

inputs_img = processor(
    text="<start_of_image>", images=img, return_tensors="pt", padding=True
).to(model.device, dtype=torch.bfloat16)
inputs_txt = processor(
    text="This is a photo of 2 cats.", return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

with torch.no_grad():
    img_tokens = model.get_image_features(pixel_values=inputs_img["pixel_values"])
    img_feats  = normalize(img_tokens.mean(dim=1), dim=-1) 

    tok_embeds = model.get_input_embeddings()(inputs_txt["input_ids"])
    tok_embeds = model(
            **inputs_txt,
            output_hidden_states=True,
            return_dict=True
        ).hidden_states[-1]
    txt_feats  = normalize(tok_embeds.mean(dim=1), dim=-1)

    sim = cosine_similarity(img_feats, txt_feats)
    print("Cosine similarity:", sim.item())

运行结果:相似度为0.0403,数值过低无意义。

SigLIP对比实现

inputs = processor(
    text=texts,
    images=image,
    padding="max_length",
    max_length=64,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
print("SigLIP similarity:", probs)

运行结果:使用google/siglip2-large-patch16-256得到0.164,google/siglip2-so400m-patch14-384得到0.245。

问题咨询

  • 提取嵌入并计算余弦相似度的方法是否有误?
  • 生成式VLM是否有推荐的池化或投影策略来生成类对比分数?
  • 是否需要微调Qwen2.5-VL、Gemma3或添加小头部来对齐嵌入?
  • 有没有适用于资源受限(6GB VRAM)场景的标准化生成式VLM图文相似度计算库?
  • 是否存在可原生计算图文相似度的视觉对话模型?

解决方案与回答

1. 嵌入提取方法的问题

你的提取逻辑存在核心问题:生成式VLM的输入token嵌入、原始视觉特征并未经过对比训练的跨模态对齐,直接取均值无法得到有意义的相似度。

  • 文本端:不能直接用embed_tokens的输出,需要取模型最后一层隐藏状态中的特殊token(如/或任务专属结束token),而非所有token的均值;
  • 视觉端:不能直接用model.visual的原始输出,需要经过模型内置的跨模态投影层,将视觉特征映射到文本特征的同维度空间。

2. 推荐的池化与投影策略

  • 文本特征:优先取第一个token()或最后一个有效token()的最后一层隐藏状态,避免用全局均值稀释关键信息;
  • 视觉特征:使用模型提供的get_image_features接口(若支持),或通过模型的visual_proj投影层将原始视觉特征映射到文本特征维度;
  • 归一化:对最终的图文特征都做L2归一化后,再计算余弦相似度。

3. 微调与头部添加建议

  • 资源受限场景下,不建议微调整个模型,最优方案是添加一个轻量的单线性投影层(对齐图文特征维度),并在小规模图文配对数据集(如COCO子集)上微调投影层,6GB VRAM完全支持;
  • 若使用Qwen2.5-VL的基础版(非Instruct版),其特征对齐性本身优于对话优化的Instruct版,可减少微调需求。

4. 资源受限场景的实现方案

无需额外库,直接基于transformers结合4bit量化即可实现:

  • 优先使用模型原生的get_text_featuresget_image_features接口获取对齐后的特征;
  • 若模型无此类接口,手动提取特殊token的隐藏状态+添加轻量投影层即可,无需额外依赖。

5. 原生支持图文相似度的视觉对话模型

  • BLIP-2:对比训练+生成式结合的模型,支持4bit量化,可直接输出对齐的图文特征;
  • LLaVA-1.5对比分支:在对话模型基础上优化了特征对齐能力;
  • Qwen2.5-VL基础版:未经过对话指令微调,特征对齐性更适合相似度计算。

修正后的代码示例

Qwen2.5-VL修正版

import torch
from transformers import (
    Qwen2_5_VLForConditionalGeneration,
    AutoProcessor,
    BitsAndBytesConfig,
)
from transformers.image_utils import load_image
from torch.nn.functional import normalize, cosine_similarity

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# 使用基础版而非Instruct版,特征对齐性更好
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B",
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quant_config,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B", use_fast=True)

img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg")

# 提取文本特征:使用原生接口获取对齐后的特征
text_inputs = processor(text=["This is a photo of 2 cats."], return_tensors="pt").to("cuda")
with torch.no_grad():
    text_feats = normalize(model.get_text_features(**text_inputs), dim=-1)

# 提取图像特征:使用原生接口获取对齐后的特征
image_inputs = processor(images=[img], return_tensors="pt").to("cuda")
with torch.no_grad():
    image_feats = normalize(model.get_image_features(**image_inputs), dim=-1)

# 计算相似度
sim = cosine_similarity(image_feats, text_feats)
print("Cosine similarity:", sim.item())

Gemma3修正版

import torch
from transformers import Gemma3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from transformers.image_utils import load_image
from torch.nn.functional import normalize, cosine_similarity

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quant_config,
    output_hidden_states=True,
    return_dict=True,
).eval()
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

img = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000039769.jpg")

# 提取图像特征并对齐维度
image_inputs = processor(images=[img], return_tensors="pt").to(model.device, dtype=torch.bfloat16)
with torch.no_grad():
    img_feats = model.get_image_features(pixel_values=image_inputs["pixel_values"])
    # 添加轻量投影层对齐文本特征维度
    if img_feats.shape[-1] != model.config.hidden_size:
        proj_layer = torch.nn.Linear(img_feats.shape[-1], model.config.hidden_size, device=model.device, dtype=torch.bfloat16)
        img_feats = proj_layer(img_feats)
    img_feats = normalize(img_feats, dim=-1)

# 提取文本特征:取<bos> token的最后一层隐藏状态
text_inputs = processor(text="This is a photo of 2 cats.", return_tensors="pt").to(model.device, dtype=torch.bfloat16)
with torch.no_grad():
    outputs = model(**text_inputs, output_hidden_states=True)
    txt_feats = outputs.hidden_states[-1][:, 0, :]  # 取第一个token(<bos>)
    txt_feats = normalize(txt_feats, dim=-1)

sim = cosine_similarity(img_feats, txt_feats)
print("Cosine similarity:", sim.item())

内容的提问来源于stack exchange,提问作者H.H

火山引擎 最新活动