仅用Core PHP提取.doc/.docx/.pdf内容并清理无效字符

阿华AIGC实验室

2026-5-14

Got it, let's break down your two problems and solve them with pure Core PHP—no third-party libraries required. First, we'll fix that messy invalid character issue from file_get_contents(), then build text extractors for .doc, .docx, and .pdf files.

Fixing Invalid Characters from `file_get_contents()`

The core issue here is that .doc, .docx, and .pdf are binary file formats, not plain text. When you use file_get_contents() directly, you're reading raw binary data—this includes formatting markers, metadata, and non-printable characters that simple preg_replace calls can't fully strip out.

Here's a better workflow:

Extract the actual text content first using format-specific logic (see below)
Run this cleanup routine to polish the extracted text:

function cleanExtractedText($text) {
    // Remove non-printable ASCII characters
    $text = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/', '', $text);
    // Collapse extra whitespace/line breaks into single spaces
    $text = preg_replace('/\s+/', ' ', $text);
    // Trim leading/trailing whitespace
    return trim($text);
}

If you were trying to clean raw binary data directly, that's why you had leftovers—always extract the text first, then clean it.

Pure Core PHP Text Extraction by File Type

1. .docx Files

Docx files are actually ZIP archives with XML inside. We can use PHP's built-in ZipArchive class to pull out the main content:

function extractDocxText($filePath) {
    $zip = new ZipArchive;
    if ($zip->open($filePath) === true) {
        // Main text lives in word/document.xml
        if ($zip->locateName('word/document.xml') !== false) {
            $xmlContent = $zip->getFromName('word/document.xml');
            $zip->close();
            
            // Strip XML tags and clean up
            $text = strip_tags($xmlContent);
            return cleanExtractedText($text);
        }
        $zip->close();
    }
    return false;
}

2. .doc Files

.doc is a proprietary binary format (OLE Compound Document), so full parsing is tricky in pure PHP. This lightweight method pulls out the main text for most basic documents:

function extractDocText($filePath) {
    $fileHandle = fopen($filePath, 'rb');
    $text = '';
    
    if ($fileHandle) {
        // Skip the OLE header (first 512 bytes)
        fseek($fileHandle, 512);
        
        // Read chunks and extract printable text
        while (!feof($fileHandle)) {
            $chunk = fread($fileHandle, 4096);
            preg_match_all('/[\x20-\x7E\xA0-\xFF]+/', $chunk, $matches);
            if (!empty($matches[0])) {
                $text .= implode(' ', $matches[0]);
            }
        }
        fclose($fileHandle);
        return cleanExtractedText($text);
    }
    return false;
}

Note: This won't handle complex formatting (like tables or embedded objects) perfectly, but it gets the core text content.

3. .pdf Files

PDFs are binary, but we can parse text streams between BT (begin text) and ET (end text) operators:

function extractPdfText($filePath) {
    $content = file_get_contents($filePath);
    $text = '';
    
    // Grab all text streams
    preg_match_all('/BT(.*?)ET/s', $content, $matches);
    
    foreach ($matches[1] as $stream) {
        // Remove PDF formatting commands and operators
        $stream = preg_replace('/\/[a-zA-Z]+[\s\d]+/', '', $stream);
        $stream = preg_replace('/[\[\]\(\)]/', '', $stream);
        $stream = preg_replace('/\s+/', ' ', $stream);
        // Decode escaped characters like \( or \)
        $stream = preg_replace('/\\\(|\)/', '', $stream);
        $text .= $stream . ' ';
    }
    
    return cleanExtractedText($text);
}

Note: This won't work for encrypted PDFs or image-only PDFs (you can't extract text from images with pure PHP), but it handles regular text-based PDFs well.

Putting It All Together

Create a wrapper function to detect the file type and call the right extractor:

function extractDocumentText($filePath) {
    $extension = strtolower(pathinfo($filePath, PATHINFO_EXTENSION));
    
    switch ($extension) {
        case 'docx':
            return extractDocxText($filePath);
        case 'doc':
            return extractDocText($filePath);
        case 'pdf':
            return extractPdfText($filePath);
        default:
            return 'Unsupported file type';
    }
}

// Example usage
$text = extractDocumentText('path/to/your/document.pdf');
echo $text;

内容的提问来源于stack exchange，提问作者Priyank