如何扩展Python文本提取函数以提取含关键词句子及其前后指定数量的句子？

阿华AIGC实验室

2026-4-27

Solution to Extract Key Sentences with Context

Let's fix both issues in your code: properly extracting full sentences and adding support for surrounding context sentences.

Why Your Original Code Failed

The root problem with your regex is that it captures sentence fragments instead of complete sentences. Using [^.]* stops at any period, so it only grabs the portion of the sentence from the last period before your keyword to the next period—missing the start of the sentence if the keyword isn't near the beginning. We'll fix this by splitting the text into full sentences first, then filtering for keywords.

Complete Improved Code

First, install NLTK (for accurate sentence splitting) if you haven't already:

pip install nltk

Here's the revised function and usage:

import pandas as pd
import re
import nltk
from nltk.tokenize import sent_tokenize

# Download NLTK's sentence tokenizer data (run once)
nltk.download('punkt')

data = [[0, 'Johannes Gensfleisch zur Laden zum Gutenberg was a German inventor, printer, publisher, and goldsmith who introduced printing to Europe with his mechanical movable-type printing press. His work started the Printing Revolution in Europe and is regarded as a milestone of the second millennium, ushering in the modern period of human history. It played a key role in the development of the Renaissance, Reformation, Age of Enlightenment, and Scientific Revolution, as well as laying the material basis for the modern knowledge-based economy and the spread of learning to the masses.'], [1, 'While not the first to use movable type in the world,[a] Gutenberg was the first European to do so. His many contributions to printing include the invention of a process for mass-producing movable type; the use of oil-based ink for printing books;[7] adjustable molds;[8] mechanical movable type; and the use of a wooden printing press similar to the agricultural screw presses of the period.[9] His truly epochal invention was the combination of these elements into a practical system that allowed the mass production of printed books and was economically viable for printers and readers alike. Gutenbergs method for making type is traditionally considered to have included a type metal alloy and a hand mould for casting type. The alloy was a mixture of lead, tin, and antimony that melted at a relatively low temperature for faster and more economical casting, cast well, and created a durable type.'], [2, 'The use of movable type was a marked improvement on the handwritten manuscript, which was the existing method of book production in Europe, and upon woodblock printing, and revolutionized European book-making. Gutenbergs printing technology spread rapidly throughout Europe and later the world. His major work, the Gutenberg Bible (also known as the 42-line Bible), was the first printed version of the Bible and has been acclaimed for its high aesthetic and technical quality. In Renaissance Europe, the arrival of mechanical movable type printing introduced the era of mass communication which permanently altered the structure of society. The relatively unrestricted circulation of information—including revolutionary ideas—transcended borders, captured the masses in the Reformation, and threatened the power of political and religious authorities; the sharp increase in literacy broke the monopoly of the literate elite on education and learning and bolstered the emerging middle class. Across Europe, the increasing cultural self-awareness of its people led to the rise of proto-nationalism, accelerated by the flowering of the European vernacular languages to the detriment of Latins status as lingua franca. In the 19th century, the replacement of the hand-operated Gutenberg-style press by steam-powered rotary presses allowed printing on an industrial scale, while Western-style printing was adopted all over the world, becoming practically the sole medium for modern bulk printing. ']]

df = pd.DataFrame(data, columns=['text_number', 'text'])

def extract_key_sentences_with_context(text, word_list, before_n=0, after_n=0):
    # Split text into full sentences using NLTK (handles edge cases like abbreviations)
    sentences = sent_tokenize(text)
    
    # Fallback: Use regex if NLTK isn't available (less accurate)
    # sentences = re.split(r'(?<=\.)\s+', text.strip())
    
    # Find all sentences containing any keyword
    key_indices = []
    for idx, sent in enumerate(sentences):
        # Add .lower() to both sent and word for case-insensitive matching
        if any(word in sent for word in word_list):
            key_indices.append(idx)
    
    # Remove duplicate indices (in case multiple keywords hit the same sentence)
    key_indices = list(set(key_indices))
    
    # Extract context for each key sentence
    result = []
    for idx in key_indices:
        # Calculate safe start/end indices to avoid index errors
        start_idx = max(0, idx - before_n)
        end_idx = min(len(sentences) - 1, idx + after_n)
        
        # Get the full context block
        context_block = sentences[start_idx:end_idx + 1]
        result.append({
            'key_sentence': sentences[idx],
            'context': context_block,
            'original_sentence_index': idx
        })
    
    return result

tools_list=['printing press','paper','ink','woodblock','molds','method']

# Extract key sentences with 1 sentence before and 1 after
df['key_sentences_with_context'] = df['text'].apply(lambda x: extract_key_sentences_with_context(str(x), tools_list, before_n=1, after_n=1))

# Preview the results
for idx, row in df.iterrows():
    print(f"\n--- Text {row['text_number']} ---")
    for entry in row['key_sentences_with_context']:
        print(f"\nKey Sentence (Index {entry['original_sentence_index']}):")
        print(entry['key_sentence'])
        print("Context:")
        print(' '.join(entry['context']))

Key Improvements

Full Sentence Extraction: Using NLTK's sent_tokenize ensures we get complete, properly split sentences, eliminating fragment issues.
Context Support: The before_n and after_n parameters let you specify how many surrounding sentences to include.
Edge Case Handling: Automatically handles key sentences at the start/end of text (no index errors) by using max(0, ...) and min(len(sentences)-1, ...).
Clear Output: Returns structured data with the key sentence, its context, and original index for easy analysis.

内容的提问来源于stack exchange，提问作者sos.cott