如何处理电影脚本PDF/TXT文件,提取主角台词构建语料库用于性别表征分析
Hey there! Let's tackle your movie script corpus project for gender representation analysis—this is such a compelling use case. Below’s a practical, step-by-step approach to processing both .txt and .pdf scripts to hit your two core goals: separating dialogue from scene descriptions, and extracting your protagonists’ lines (leveraging the fact that all character names are uppercase).
Text-based scripts are straightforward to parse since we can work directly with line breaks and pattern matching.
1. Separate Dialogue from Scene Descriptions
Movie scripts follow a consistent structure: scene descriptions are regular prose (not all-uppercase), character names are fully uppercase (often ending with a colon), and their dialogue follows on subsequent lines. We’ll use regex to identify character lines, then split content accordingly.
2. Extract Protagonists' Lines
Once we’ve parsed all dialogue, filtering for your target protagonists is just a matter of matching their uppercase names against the parsed speaker entries.
Python Implementation Example
import re def parse_txt_script(file_path, target_protagonists): # Read the entire script with open(file_path, 'r', encoding='utf-8') as script_file: script_lines = script_file.readlines() scene_content = [] all_dialogue = [] current_speaker = None current_dialogue_lines = [] # Regex pattern to match uppercase character names (supports optional parentheticals like (V.O.)) character_pattern = re.compile(r'^[A-Z]+(?:\s[A-Z]+)*?(?:\s\([A-Z\s.]+\))?:\s*$') for line in script_lines: cleaned_line = line.strip() if not cleaned_line: continue # Skip empty lines # Check if this line is a character name if character_pattern.match(cleaned_line): # Save any existing dialogue from the previous speaker if current_speaker and current_dialogue_lines: all_dialogue.append({ 'speaker': current_speaker.strip(':'), 'lines': '\n'.join(current_dialogue_lines) }) current_dialogue_lines = [] current_speaker = cleaned_line else: # If we have an active speaker, this line is dialogue if current_speaker: current_dialogue_lines.append(cleaned_line) # Otherwise, it's part of the scene description else: scene_content.append(cleaned_line) # Add the final dialogue entry if it exists if current_speaker and current_dialogue_lines: all_dialogue.append({ 'speaker': current_speaker.strip(':'), 'lines': '\n'.join(current_dialogue_lines) }) # Filter dialogue to only include protagonists protag_dialogue = [entry for entry in all_dialogue if entry['speaker'].strip() in target_protagonists] return { 'scene_descriptions': '\n'.join(scene_content), 'full_dialogue': all_dialogue, 'protagonist_dialogue': protag_dialogue } # How to use this function script_data = parse_txt_script('your_script.txt', ['ELENA', 'JAMES']) # Print out protagonist lines for quick verification for line_entry in script_data['protagonist_dialogue']: print(f"{line_entry['speaker']}: {line_entry['lines']}")
PDFs require an extra step: extracting the text while preserving layout as much as possible. pdfplumber is a great tool for this (better than older libraries like PyPDF2 at retaining line breaks and structure).
1. Convert PDF to Structured Text
First, extract the text from each page of the PDF. Then, process the extracted text using the same logic as we did for .txt scripts.
2. Extract Dialogue & Protagonist Lines
Reuse the regex and parsing logic from the .txt workflow—once the PDF is converted to text, the structure should be identical.
Python Implementation Example
import pdfplumber import re def parse_pdf_script(file_path, target_protagonists): # Extract text from PDF full_script_text = "" with pdfplumber.open(file_path) as pdf: for page in pdf.pages: # Extract text with layout preserved (critical for script structure) page_text = page.extract_text(layout=True) if page_text: full_script_text += page_text # Split into lines and process the same way as a .txt script script_lines = full_script_text.split('\n') scene_content = [] all_dialogue = [] current_speaker = None current_dialogue_lines = [] character_pattern = re.compile(r'^[A-Z]+(?:\s[A-Z]+)*?(?:\s\([A-Z\s.]+\))?:\s*$') for line in script_lines: cleaned_line = line.strip() if not cleaned_line: continue if character_pattern.match(cleaned_line): if current_speaker and current_dialogue_lines: all_dialogue.append({ 'speaker': current_speaker.strip(':'), 'lines': '\n'.join(current_dialogue_lines) }) current_dialogue_lines = [] current_speaker = cleaned_line else: if current_speaker: current_dialogue_lines.append(cleaned_line) else: scene_content.append(cleaned_line) if current_speaker and current_dialogue_lines: all_dialogue.append({ 'speaker': current_speaker.strip(':'), 'lines': '\n'.join(current_dialogue_lines) }) protag_dialogue = [entry for entry in all_dialogue if entry['speaker'].strip() in target_protagonists] return { 'scene_descriptions': '\n'.join(scene_content), 'full_dialogue': all_dialogue, 'protagonist_dialogue': protag_dialogue } # How to use this function script_data = parse_pdf_script('your_script.pdf', ['ELENA', 'JAMES'])
- Encoding Issues: If your
.txtscript throws an encoding error, try replacingencoding='utf-8'withencoding='latin-1'(common for older scripts). - Odd Formatting: Some indie or older scripts might have character names indented—adjust the regex to include optional leading spaces:
r'^\s*[A-Z]+(?:\s[A-Z]+)*?(?:\s\([A-Z\s.]+\))?:\s*$'. - Long Dialogue Blocks: If dialogue spans multiple paragraphs, the code above will capture all lines until the next character name, which should work for most scripts.
Let me know if you hit snags with specific scripts—we can tweak the regex or parsing logic to handle edge cases!
内容的提问来源于stack exchange,提问作者Mageek101




