Django项目中用Python提取docx内MCQ/填空题并入库的技术问询
Got it, let's break down how to solve this problem— I’ve built similar quiz parsing systems before, so here are practical, actionable steps you can take:
1. Ditch textract for python-docx (it keeps critical formatting data)
Textract is convenient for quick text pulls, but it strips out all the formatting metadata you need to tell questions apart from options or formulas. python-docx lets you access every paragraph’s style, indentation, numbering, and font attributes— exactly what you need to categorize content accurately.
Here’s a quick starter snippet to get you going:
from docx import Document doc = Document("user_uploaded_quiz.docx") # Track the current question to link options to it current_question = None for para in doc.paragraphs: text = para.text.strip() if not text: continue # Skip empty lines # Check for numbered, bolded questions (common quiz formatting) if para.style.name.startswith("List Number") and any(run.bold for run in para.runs): current_question = text print(f"Question: {text}") # Add logic to save this as a Question object in your Django DB # Check for MCQ options (usually start with A./B./C./D.) elif text.startswith(('A.', 'B.', 'C.', 'D.', 'a.', 'b.', 'c.', 'd.')) and current_question: print(f"Option for '{current_question}': {text}") # Link this option to the current question in your DB # Check for fill-in-the-blanks (look for underlines or placeholders) elif '_' * 3 in text or '【】' in text or '()' in text: print(f"Fill-in-the-blank: {text}") # Save as a FillBlank object in your DB
2. Create custom parsing rules tailored to your docx format
Every quiz docx has its own style— tweak these rules to match your users’ uploads. Here are common patterns to target:
- Questions:
- Start with numeric numbering (e.g.,
1.,(1),①) - Use bold text or larger font size than options
- Have no indent (options often have a small indent)
- Start with numeric numbering (e.g.,
- MCQ Options:
- Start with letters + dots/parentheses (e.g.,
A.,(B)) - Have consistent left indent (e.g., 0.5 inches)
- Use regular (non-bold) font
- Start with letters + dots/parentheses (e.g.,
- Fill-in-the-blanks:
- Contain long underlines (
______), empty brackets (【】), or parentheses() - May include phrases like "Please fill in: " or "_____"
- Contain long underlines (
- Formulas:
- Docx formulas are usually stored as OMML objects. Use
python-docxto detect these via XML tags, then convert them to a storable format (like LaTeX with theomml2latexlibrary). For image-based formulas, use OCR tools (like pytesseract) to extract text, or store the image binary directly in your DB.
- Docx formulas are usually stored as OMML objects. Use
3. Use regex as a fallback for messy formats
If some users upload docs with inconsistent formatting, regex can catch edge cases:
import re # Regex patterns tweak as needed question_re = re.compile(r"^\d+\.\s+.+$|^\(\d+\)\s+.+$") # Matches 1. ... or (1) ... option_re = re.compile(r"^[A-Da-d]\.\s+.+$|^\([A-Da-d]\)\s+.+$") # Matches A. ... or (A) ... blank_re = re.compile(r".*(_){3,}.*|.*【】.*") # Matches 3+ underscores or empty brackets for para in doc.paragraphs: text = para.text.strip() if not text: continue if question_re.match(text): print(f"Question (regex match): {text}") elif option_re.match(text): print(f"Option (regex match): {text}") elif blank_re.match(text): print(f"Fill-in-the-blank (regex match): {text}")
4. Test with real docs and iterate
Grab a few sample docs your users will upload, test your parsing logic, and adjust rules as needed. For example, if some questions use 1) instead of 1., update your regex or style checks to account for that. Small tweaks will make your parser way more reliable.
内容的提问来源于stack exchange,提问作者Nikhil Sardana




