空白SpaCy模型与预训练SpaCy模型调用nlp.update()的差异及微调en_core_web

空白SpaCy模型与预训练SpaCy模型调用nlp.update()的差异及微调en_core_web_trf模型触发ValueError的问题排查

阿华AIGC实验室

2026-4-30

Fixing ValueError When Fine-Tuning spaCy's en_core_web_trf for Custom NER

Let's break down your problem step by step and fix that frustrating ValueError first, then clarify the key differences between training blank vs. pre-trained spaCy models.

Root Cause of the ValueError

Your error comes from three critical issues in your code:

Incomplete Example Construction: You forgot to pass the annotations parameter when creating Example objects—this breaks the training data structure the model expects.
Wrong Optimizer Setup for Pre-Trained Models: Using nlp.create_optimizer() instead of nlp.resume_training() discards the pre-trained model's existing training state, which is especially problematic for transformer-based pipelines.
Incorrect Pipe Disabling: When fine-tuning NER with a transformer model, you need to keep the transformer pipe active (it handles feature extraction for the NER component)—disabling it breaks the data flow to the NER model.

Corrected Training Code

Here's revised code that works seamlessly for both blank and pre-trained (including transformer) models:

import random
from spacy.training import Example, minibatch, compounding
import spacy

def train_custom_ner(TRAIN_DATA, model_name=None, dropout=0.5, nIter=10):
    # Load pre-trained model or initialize blank
    if model_name:
        nlp = spacy.load(model_name)
        print(f"Loaded pre-trained model: {model_name}")
    else:
        nlp = spacy.blank("en")
        print("Created blank English model")

    # Add/access NER pipe
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    # Register custom labels with the NER component
    for text, annotations in TRAIN_DATA:
        for ent in annotations.get("entities", []):
            ner.add_label(ent[2])

    # Build training examples correctly
    examples_train = []
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)  # Fixed: added annotations
        examples_train.append(example)

    # Define which pipes to keep active
    pipe_exceptions = ["ner"]
    if "transformer" in nlp.pipe_names:
        pipe_exceptions.append("transformer")  # Keep transformer for feature extraction
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

    with nlp.disable_pipes(*other_pipes):
        # Initialize or resume training
        if model_name is None:
            optimizer = nlp.initialize()  # Fresh start for blank model
        else:
            optimizer = nlp.resume_training()  # Preserve pre-trained state

        # Training loop
        for itn in range(nIter):
            random.shuffle(examples_train)
            losses_train = {}
            batches = minibatch(examples_train, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                nlp.update(
                    batch,
                    drop=dropout,
                    losses=losses_train,
                    sgd=optimizer
                )
            print(f"Iteration {itn+1}: Training Loss = {losses_train.get('ner', 0):.4f}")
    
    return nlp

# Usage examples
# Train blank model:
# nlp_blank = train_custom_ner(TRAIN_DATA, model_name=None)
# Fine-tune transformer model:
# nlp_trf = train_custom_ner(TRAIN_DATA, model_name="en_core_web_trf")

Key Differences Between `nlp.update()` for Blank vs. Pre-Trained Models

Let's clear up the core distinctions:

Blank Models:
- Require nlp.initialize() to set up random weights for all components and create a fresh optimizer from scratch.
- You must explicitly pass the sgd optimizer to nlp.update()—there's no pre-existing training state to leverage.
- All components start from zero, so you need to manually add custom labels and configure pipes before training.
Pre-Trained Models (Including Transformers):
- Use nlp.resume_training() instead of nlp.initialize() to preserve pre-trained weights and the existing optimizer state. This avoids overwriting valuable learned features.
- You only need to add new custom labels to the NER pipe (no need to reinitialize the entire pipeline).
- Transformer-based models depend on the transformer pipe for high-quality feature extraction—never disable this pipe during fine-tuning.
- nlp.update() adjusts weights incrementally on top of the pre-trained model, leading to faster convergence and better performance, especially with small datasets.