Django项目中用Python提取docx内MCQ/填空题并入库的技术问询

阿华AIGC实验室

2026-5-26

How to Parse MCQs & Fill-in-the-Blanks from Docx Files in Django

Got it, let's break down how to solve this problem— I’ve built similar quiz parsing systems before, so here are practical, actionable steps you can take:

1. Ditch textract for `python-docx` (it keeps critical formatting data)

Textract is convenient for quick text pulls, but it strips out all the formatting metadata you need to tell questions apart from options or formulas. python-docx lets you access every paragraph’s style, indentation, numbering, and font attributes— exactly what you need to categorize content accurately.

Here’s a quick starter snippet to get you going:

from docx import Document

doc = Document("user_uploaded_quiz.docx")

# Track the current question to link options to it
current_question = None

for para in doc.paragraphs:
    text = para.text.strip()
    if not text:
        continue  # Skip empty lines
    
    # Check for numbered, bolded questions (common quiz formatting)
    if para.style.name.startswith("List Number") and any(run.bold for run in para.runs):
        current_question = text
        print(f"Question: {text}")
        # Add logic to save this as a Question object in your Django DB
    
    # Check for MCQ options (usually start with A./B./C./D.)
    elif text.startswith(('A.', 'B.', 'C.', 'D.', 'a.', 'b.', 'c.', 'd.')) and current_question:
        print(f"Option for '{current_question}': {text}")
        # Link this option to the current question in your DB
    
    # Check for fill-in-the-blanks (look for underlines or placeholders)
    elif '_' * 3 in text or '【】' in text or '()' in text:
        print(f"Fill-in-the-blank: {text}")
        # Save as a FillBlank object in your DB

2. Create custom parsing rules tailored to your docx format

Every quiz docx has its own style— tweak these rules to match your users’ uploads. Here are common patterns to target:

Questions:
- Start with numeric numbering (e.g., 1. , (1) , ① )
- Use bold text or larger font size than options
- Have no indent (options often have a small indent)
MCQ Options:
- Start with letters + dots/parentheses (e.g., A. , (B))
- Have consistent left indent (e.g., 0.5 inches)
- Use regular (non-bold) font
Fill-in-the-blanks:
- Contain long underlines (______), empty brackets (【】), or parentheses ()
- May include phrases like "Please fill in: " or "_____"
Formulas:
- Docx formulas are usually stored as OMML objects. Use python-docx to detect these via XML tags, then convert them to a storable format (like LaTeX with the omml2latex library). For image-based formulas, use OCR tools (like pytesseract) to extract text, or store the image binary directly in your DB.

3. Use regex as a fallback for messy formats

If some users upload docs with inconsistent formatting, regex can catch edge cases:

import re

# Regex patterns tweak as needed
question_re = re.compile(r"^\d+\.\s+.+$|^\(\d+\)\s+.+$")  # Matches 1. ... or (1) ...
option_re = re.compile(r"^[A-Da-d]\.\s+.+$|^\([A-Da-d]\)\s+.+$")  # Matches A. ... or (A) ...
blank_re = re.compile(r".*(_){3,}.*|.*【】.*")  # Matches 3+ underscores or empty brackets

for para in doc.paragraphs:
    text = para.text.strip()
    if not text:
        continue
    if question_re.match(text):
        print(f"Question (regex match): {text}")
    elif option_re.match(text):
        print(f"Option (regex match): {text}")
    elif blank_re.match(text):
        print(f"Fill-in-the-blank (regex match): {text}")

4. Test with real docs and iterate

Grab a few sample docs your users will upload, test your parsing logic, and adjust rules as needed. For example, if some questions use 1) instead of 1., update your regex or style checks to account for that. Small tweaks will make your parser way more reliable.

内容的提问来源于stack exchange，提问作者Nikhil Sardana