Keras Dense层维度适配问题:短语同义词标注输出维度不符
Let's break down why your current setup is underperforming and how to fix it:
The Core Issue with Your Current Model
Your BiLSTM is set to not return sequences (return_sequences=False), which means it collapses the entire 58-token sequence into a single 256-dimensional vector. Feeding this global vector into a Dense(290) layer forces the model to map one global representation to 58 individual token-level predictions (each with 5 classes). This throws away all position-specific context the model needs to predict per-token synonym tags—no wonder performance is poor!
The Correct Architecture for Token-Level Classification
To get your desired output shape (None, 58, 5), you need to preserve sequence information through the BiLSTM and apply the dense classification to each token independently. Here's how to adjust your model:
- Enable sequence return in BiLSTM: Set
return_sequences=Trueso the BiLSTM outputs a(None, 58, 256)tensor—one 256-d vector per token in the sequence. - Use
TimeDistributedto wrap the Dense layer: This applies theDense(5)layer to every time step (token) in the sequence, producing the exact(None, 58, 5)output you need.
Example Keras Implementation
from tensorflow.keras.layers import Input, Embedding, Bidirectional, LSTM, TimeDistributed, Dense, concatenate from tensorflow.keras.models import Model # Load pre-trained GloVE embedding matrix (replace with your actual matrix) glove_embedding_matrix = ... # Define input layers for two sentences input1 = Input(shape=(58,)) input2 = Input(shape=(58,)) # GloVE embedding layer (shared or separate—adjust based on your needs) embedding_layer = Embedding( input_dim=vocab_size, output_dim=300, weights=[glove_embedding_matrix], input_length=58, trainable=False # Set to True if you want to fine-tune GloVE ) # Embed both inputs embedded1 = embedding_layer(input1) embedded2 = embedding_layer(input2) # BiLSTM with return_sequences=True to keep per-token outputs bilstm1 = Bidirectional(LSTM(256, return_sequences=True))(embedded1) bilstm2 = Bidirectional(LSTM(256, return_sequences=True))(embedded2) # Merge the two sequence outputs (concatenate, add, or multiply—choose what works best) merged = concatenate([bilstm1, bilstm2], axis=-1) # Apply Dense(5) to every token in the merged sequence output = TimeDistributed(Dense(5, activation='softmax'))(merged) # Build and compile the model model = Model(inputs=[input1, input2], outputs=output) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Label Handling Adjustment
Instead of flattening your labels to (None, 290), keep them in the shape (None, 58, 5) (one-hot encoded for 5 classes per token). If your labels are integer indices (0-4), use sparse_categorical_crossentropy instead of categorical_crossentropy and skip one-hot encoding.
Why This Works
By retaining sequence outputs from the BiLSTM, each token's representation carries context from the entire sentence (both forward and backward, thanks to the bidirectional layer). The TimeDistributed layer ensures each token's prediction is based on its own context-rich vector, which is critical for accurate per-token synonym tagging.
内容的提问来源于stack exchange,提问作者Aladar Miao




