如何处理电影脚本PDF/TXT文件，提取主角台词构建语料库用于性别表征分析

阿华AIGC实验室

2026-5-7

Hey there! Let's tackle your movie script corpus project for gender representation analysis—this is such a compelling use case. Below’s a practical, step-by-step approach to processing both .txt and .pdf scripts to hit your two core goals: separating dialogue from scene descriptions, and extracting your protagonists’ lines (leveraging the fact that all character names are uppercase).

Processing .txt Movie Scripts

Text-based scripts are straightforward to parse since we can work directly with line breaks and pattern matching.

1. Separate Dialogue from Scene Descriptions

Movie scripts follow a consistent structure: scene descriptions are regular prose (not all-uppercase), character names are fully uppercase (often ending with a colon), and their dialogue follows on subsequent lines. We’ll use regex to identify character lines, then split content accordingly.

2. Extract Protagonists' Lines

Once we’ve parsed all dialogue, filtering for your target protagonists is just a matter of matching their uppercase names against the parsed speaker entries.

Python Implementation Example

import re

def parse_txt_script(file_path, target_protagonists):
    # Read the entire script
    with open(file_path, 'r', encoding='utf-8') as script_file:
        script_lines = script_file.readlines()
    
    scene_content = []
    all_dialogue = []
    current_speaker = None
    current_dialogue_lines = []
    
    # Regex pattern to match uppercase character names (supports optional parentheticals like (V.O.))
    character_pattern = re.compile(r'^[A-Z]+(?:\s[A-Z]+)*?(?:\s\([A-Z\s.]+\))?:\s*$')
    
    for line in script_lines:
        cleaned_line = line.strip()
        if not cleaned_line:
            continue  # Skip empty lines
        
        # Check if this line is a character name
        if character_pattern.match(cleaned_line):
            # Save any existing dialogue from the previous speaker
            if current_speaker and current_dialogue_lines:
                all_dialogue.append({
                    'speaker': current_speaker.strip(':'),
                    'lines': '\n'.join(current_dialogue_lines)
                })
                current_dialogue_lines = []
            current_speaker = cleaned_line
        else:
            # If we have an active speaker, this line is dialogue
            if current_speaker:
                current_dialogue_lines.append(cleaned_line)
            # Otherwise, it's part of the scene description
            else:
                scene_content.append(cleaned_line)
    
    # Add the final dialogue entry if it exists
    if current_speaker and current_dialogue_lines:
        all_dialogue.append({
            'speaker': current_speaker.strip(':'),
            'lines': '\n'.join(current_dialogue_lines)
        })
    
    # Filter dialogue to only include protagonists
    protag_dialogue = [entry for entry in all_dialogue if entry['speaker'].strip() in target_protagonists]
    
    return {
        'scene_descriptions': '\n'.join(scene_content),
        'full_dialogue': all_dialogue,
        'protagonist_dialogue': protag_dialogue
    }

# How to use this function
script_data = parse_txt_script('your_script.txt', ['ELENA', 'JAMES'])
# Print out protagonist lines for quick verification
for line_entry in script_data['protagonist_dialogue']:
    print(f"{line_entry['speaker']}: {line_entry['lines']}")

Processing .pdf Movie Scripts

PDFs require an extra step: extracting the text while preserving layout as much as possible. pdfplumber is a great tool for this (better than older libraries like PyPDF2 at retaining line breaks and structure).

1. Convert PDF to Structured Text

First, extract the text from each page of the PDF. Then, process the extracted text using the same logic as we did for .txt scripts.

2. Extract Dialogue & Protagonist Lines

Reuse the regex and parsing logic from the .txt workflow—once the PDF is converted to text, the structure should be identical.

Python Implementation Example

import pdfplumber
import re

def parse_pdf_script(file_path, target_protagonists):
    # Extract text from PDF
    full_script_text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            # Extract text with layout preserved (critical for script structure)
            page_text = page.extract_text(layout=True)
            if page_text:
                full_script_text += page_text
    
    # Split into lines and process the same way as a .txt script
    script_lines = full_script_text.split('\n')
    scene_content = []
    all_dialogue = []
    current_speaker = None
    current_dialogue_lines = []
    
    character_pattern = re.compile(r'^[A-Z]+(?:\s[A-Z]+)*?(?:\s\([A-Z\s.]+\))?:\s*$')
    
    for line in script_lines:
        cleaned_line = line.strip()
        if not cleaned_line:
            continue
        
        if character_pattern.match(cleaned_line):
            if current_speaker and current_dialogue_lines:
                all_dialogue.append({
                    'speaker': current_speaker.strip(':'),
                    'lines': '\n'.join(current_dialogue_lines)
                })
                current_dialogue_lines = []
            current_speaker = cleaned_line
        else:
            if current_speaker:
                current_dialogue_lines.append(cleaned_line)
            else:
                scene_content.append(cleaned_line)
    
    if current_speaker and current_dialogue_lines:
        all_dialogue.append({
            'speaker': current_speaker.strip(':'),
            'lines': '\n'.join(current_dialogue_lines)
        })
    
    protag_dialogue = [entry for entry in all_dialogue if entry['speaker'].strip() in target_protagonists]
    
    return {
        'scene_descriptions': '\n'.join(scene_content),
        'full_dialogue': all_dialogue,
        'protagonist_dialogue': protag_dialogue
    }

# How to use this function
script_data = parse_pdf_script('your_script.pdf', ['ELENA', 'JAMES'])

Quick Tips for Edge Cases

Encoding Issues: If your .txt script throws an encoding error, try replacing encoding='utf-8' with encoding='latin-1' (common for older scripts).
Odd Formatting: Some indie or older scripts might have character names indented—adjust the regex to include optional leading spaces: r'^\s*[A-Z]+(?:\s[A-Z]+)*?(?:\s$[A-Z\s.]+$)?:\s*$'.
Long Dialogue Blocks: If dialogue spans multiple paragraphs, the code above will capture all lines until the next character name, which should work for most scripts.

Let me know if you hit snags with specific scripts—we can tweak the regex or parsing logic to handle edge cases!

内容的提问来源于stack exchange，提问作者Mageek101