You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何检测DOCX文档中问题是否存在后续重复提及及关联信息?

How to Identify Questions with Additional Details in Your DOCX Document

Here's a practical, code-based approach to solve this problem using Python—since it’s flexible for text processing and has great libraries for handling DOCX files:

Step 1: Extract Plain Text from the DOCX File

First, we need to convert the DOCX into plain text so we can easily parse questions and their context. Use the python-docx library for this:

from docx import Document

def get_docx_text(doc_path):
    doc = Document(doc_path)
    return "\n".join([para.text for para in doc.paragraphs])

This function reads all paragraphs in the DOCX and returns them as a single string with line breaks.

Step 2: Identify All Questions in the Document

Next, we need to extract every question in the text. Since your questions end with (Chinese question mark), we can split the text and collect valid questions:

def extract_all_questions(text):
    # Split text by question marks, then clean up each segment
    raw_segments = text.split("?")
    questions = []
    for seg in raw_segments:
        seg = seg.strip()
        if seg:  # Skip empty segments
            questions.append(f"{seg}?")
    return questions

This gives us a list of every question in the document, in the order they appear.

Step 3: Track Question Occurrences and Check for Additional Info

Now, we’ll go through the list of questions and check if each one appears again later, followed by non-question content (your "additional details"). We’ll use a dictionary to track results:

def check_questions_for_details(text, questions):
    # Start with all questions marked as having no details
    question_details = {q: False for q in questions}
    
    # Iterate through each question's position in the list
    for idx, question in enumerate(questions):
        # Look for the next occurrence of the same question
        for next_idx in range(idx + 1, len(questions)):
            if questions[next_idx] == question:
                # Find where this second occurrence ends in the raw text
                question_start = text.find(question, text.find(question) + 1)
                question_end = question_start + len(question)
                text_after = text[question_end:].strip()
                
                # Check if the content after is not a question (adjust logic as needed)
                if text_after and not text_after.startswith("问题") and not text_after.endswith("?"):
                    question_details[question] = True
                    break  # Stop checking once we find valid details
    return question_details

Step 4: Put It All Together

Run the functions and get your results:

doc_path = "your_document.docx"
full_text = get_docx_text(doc_path)
all_questions = extract_all_questions(full_text)
results = check_questions_for_details(full_text, all_questions)

# Print the final results
for question, has_details in results.items():
    print(f"Question: {question} | Has additional info: {'Yes' if has_details else 'No'}")

Customization Tips

  • Adjust Question Detection: If your questions use mixed Chinese/English question marks, update the split character in extract_all_questions to handle both and ?.
  • Refine Detail Check: If your additional info might include question marks (unlikely, but possible), tweak the logic—for example, check if the next paragraph is longer than a typical question, or doesn’t follow your question naming pattern (like "问题X?").
  • Handle Duplicate Initial Questions: If the initial block has duplicate questions, use a set to get unique questions first, then check each unique one.

This approach works for any set of questions in your DOCX and will reliably tell you which questions have additional details attached later in the document.

内容的提问来源于stack exchange,提问作者kaushik

火山引擎 最新活动