机器学习模型训练中Python KeyError:"@"问题的原因排查与解决方法咨询

阿华AIGC实验室

2026-4-29

Fixing KeyError: "@" in SMILES Encoding for ML Models

Let's break down why you're hitting this KeyError and walk through actionable fixes to avoid similar issues in the future.

Root Cause of the Error

Your KeyError: "@" happens because:

One or more SMILES strings in your input batch x_batch contain the chiral center symbol "@" (used to denote R/S configurations in molecular structures).
Your token-to-index mapping dictionary __t2i does not include "@" as a valid key. When the code tries to look up __t2i[tok] for the token "@", it can't find a match and throws the error.

Looking at your __t2i definition, you're missing several common SMILES tokens beyond "@"—like "Br", "I", "1", or "/" (another chiral symbol)—which will cause similar errors if they appear in your data.

Step-by-Step Fixes

1. Update the Token-to-Index Mapping

First, add all missing SMILES tokens to __t2i. At minimum, include "@" and other frequent tokens to cover standard molecular structures:

__t2i = {
    '>': 1, '<': 2, '2': 3, 'F': 4, 'Cl': 5, 'N': 6, '[': 7, '6': 8, 'O': 9, 'c': 10,
    ']': 11, '#': 12, '=': 13, '3': 14, ')': 15, '4': 16, '-': 17, 'n': 18, 'o': 19, '5': 20,
    '@': 21, 'Br': 22, 'I': 23, '1':24, '/':25, '\\':26  # Add missing common tokens
}

Pro tip: Cross-reference your mapping with a standard SMILES token list to ensure you don't miss any other critical tokens.

2. Add Error Handling for Unknown Tokens

Even with a complete mapping, unexpected tokens might slip into your data. Add safeguards to avoid crashes and debug quickly:

Option A: Use `dict.get()` with a Default Value

Assign a special index (e.g., 0) to unknown tokens so your code can run without crashing:

# Replace the error line with this
tokens = ([1] + [__t2i.get(tok, 0) for tok in smiles_tokenizer(s)])[:pad_size - 1]

Note: Make sure your model is trained to handle this "unknown" token (index 0) appropriately.

Option B: Throw a Descriptive Error

If you want to catch unknown tokens early (before training), raise a clear error with context:

tokens = []
for tok in smiles_tokenizer(s):
    if tok not in __t2i:
        raise ValueError(f"Unrecognized SMILES token '{tok}' found in string: {s}")
    tokens.append(__t2i[tok])
tokens = ([1] + tokens)[:pad_size - 1]

This helps you quickly identify problematic SMILES strings in your dataset.

3. Validate Your Tokenizer's Behavior

Ensure your smiles_tokenizer is correctly splitting SMILES strings into tokens. For example, test a chiral SMILES like CC(=O)N[C@@H](C)O—the tokenizer should output tokens that include "@". If it doesn't, your _atoms_re regex might be missing a pattern to capture chiral symbols. Adjust the regex to properly split "@" as a standalone token.

Preventing Future KeyErrors

Audit your dataset: Check for all unique tokens in your SMILES strings before finalizing __t2i. You can do this with a simple loop to collect all tokens from your training data.
Use a pre-built SMILES tokenizer: Consider leveraging established tokenizers (like those from RDKit or Hugging Face's ChemBERTa) that already handle all standard SMILES tokens out of the box.
Add unit tests: Write tests for your encoding function using edge-case SMILES (chiral centers, rare elements, unusual bonds) to catch missing tokens early.

内容的提问来源于stack exchange，提问作者fraulifang