机器学习模型训练中Python KeyError:"@"问题的原因排查与解决方法咨询
Let's break down why you're hitting this KeyError and walk through actionable fixes to avoid similar issues in the future.
Root Cause of the Error
Your KeyError: "@" happens because:
- One or more SMILES strings in your input batch
x_batchcontain the chiral center symbol "@" (used to denote R/S configurations in molecular structures). - Your token-to-index mapping dictionary
__t2idoes not include "@" as a valid key. When the code tries to look up__t2i[tok]for the token "@", it can't find a match and throws the error.
Looking at your __t2i definition, you're missing several common SMILES tokens beyond "@"—like "Br", "I", "1", or "/" (another chiral symbol)—which will cause similar errors if they appear in your data.
Step-by-Step Fixes
1. Update the Token-to-Index Mapping
First, add all missing SMILES tokens to __t2i. At minimum, include "@" and other frequent tokens to cover standard molecular structures:
__t2i = { '>': 1, '<': 2, '2': 3, 'F': 4, 'Cl': 5, 'N': 6, '[': 7, '6': 8, 'O': 9, 'c': 10, ']': 11, '#': 12, '=': 13, '3': 14, ')': 15, '4': 16, '-': 17, 'n': 18, 'o': 19, '5': 20, '@': 21, 'Br': 22, 'I': 23, '1':24, '/':25, '\\':26 # Add missing common tokens }
Pro tip: Cross-reference your mapping with a standard SMILES token list to ensure you don't miss any other critical tokens.
2. Add Error Handling for Unknown Tokens
Even with a complete mapping, unexpected tokens might slip into your data. Add safeguards to avoid crashes and debug quickly:
Option A: Use dict.get() with a Default Value
Assign a special index (e.g., 0) to unknown tokens so your code can run without crashing:
# Replace the error line with this tokens = ([1] + [__t2i.get(tok, 0) for tok in smiles_tokenizer(s)])[:pad_size - 1]
Note: Make sure your model is trained to handle this "unknown" token (index 0) appropriately.
Option B: Throw a Descriptive Error
If you want to catch unknown tokens early (before training), raise a clear error with context:
tokens = [] for tok in smiles_tokenizer(s): if tok not in __t2i: raise ValueError(f"Unrecognized SMILES token '{tok}' found in string: {s}") tokens.append(__t2i[tok]) tokens = ([1] + tokens)[:pad_size - 1]
This helps you quickly identify problematic SMILES strings in your dataset.
3. Validate Your Tokenizer's Behavior
Ensure your smiles_tokenizer is correctly splitting SMILES strings into tokens. For example, test a chiral SMILES like CC(=O)N[C@@H](C)O—the tokenizer should output tokens that include "@". If it doesn't, your _atoms_re regex might be missing a pattern to capture chiral symbols. Adjust the regex to properly split "@" as a standalone token.
Preventing Future KeyErrors
- Audit your dataset: Check for all unique tokens in your SMILES strings before finalizing
__t2i. You can do this with a simple loop to collect all tokens from your training data. - Use a pre-built SMILES tokenizer: Consider leveraging established tokenizers (like those from RDKit or Hugging Face's ChemBERTa) that already handle all standard SMILES tokens out of the box.
- Add unit tests: Write tests for your encoding function using edge-case SMILES (chiral centers, rare elements, unusual bonds) to catch missing tokens early.
内容的提问来源于stack exchange,提问作者fraulifang




