Keras Dense层维度适配问题：短语同义词标注输出维度不符

阿华AIGC实验室

2026-5-13

Fixing Your Phrase Synonym Tagging Model Performance

Let's break down why your current setup is underperforming and how to fix it:

The Core Issue with Your Current Model

Your BiLSTM is set to not return sequences (return_sequences=False), which means it collapses the entire 58-token sequence into a single 256-dimensional vector. Feeding this global vector into a Dense(290) layer forces the model to map one global representation to 58 individual token-level predictions (each with 5 classes). This throws away all position-specific context the model needs to predict per-token synonym tags—no wonder performance is poor!

The Correct Architecture for Token-Level Classification

To get your desired output shape (None, 58, 5), you need to preserve sequence information through the BiLSTM and apply the dense classification to each token independently. Here's how to adjust your model:

Enable sequence return in BiLSTM: Set return_sequences=True so the BiLSTM outputs a (None, 58, 256) tensor—one 256-d vector per token in the sequence.
Use TimeDistributed to wrap the Dense layer: This applies the Dense(5) layer to every time step (token) in the sequence, producing the exact (None, 58, 5) output you need.

Example Keras Implementation

from tensorflow.keras.layers import Input, Embedding, Bidirectional, LSTM, TimeDistributed, Dense, concatenate
from tensorflow.keras.models import Model

# Load pre-trained GloVE embedding matrix (replace with your actual matrix)
glove_embedding_matrix = ...

# Define input layers for two sentences
input1 = Input(shape=(58,))
input2 = Input(shape=(58,))

# GloVE embedding layer (shared or separate—adjust based on your needs)
embedding_layer = Embedding(
    input_dim=vocab_size,
    output_dim=300,
    weights=[glove_embedding_matrix],
    input_length=58,
    trainable=False  # Set to True if you want to fine-tune GloVE
)

# Embed both inputs
embedded1 = embedding_layer(input1)
embedded2 = embedding_layer(input2)

# BiLSTM with return_sequences=True to keep per-token outputs
bilstm1 = Bidirectional(LSTM(256, return_sequences=True))(embedded1)
bilstm2 = Bidirectional(LSTM(256, return_sequences=True))(embedded2)

# Merge the two sequence outputs (concatenate, add, or multiply—choose what works best)
merged = concatenate([bilstm1, bilstm2], axis=-1)

# Apply Dense(5) to every token in the merged sequence
output = TimeDistributed(Dense(5, activation='softmax'))(merged)

# Build and compile the model
model = Model(inputs=[input1, input2], outputs=output)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Label Handling Adjustment

Instead of flattening your labels to (None, 290), keep them in the shape (None, 58, 5) (one-hot encoded for 5 classes per token). If your labels are integer indices (0-4), use sparse_categorical_crossentropy instead of categorical_crossentropy and skip one-hot encoding.

Why This Works

By retaining sequence outputs from the BiLSTM, each token's representation carries context from the entire sentence (both forward and backward, thanks to the bidirectional layer). The TimeDistributed layer ensures each token's prediction is based on its own context-rich vector, which is critical for accurate per-token synonym tagging.

内容的提问来源于stack exchange，提问作者Aladar Miao