Python入门者求助：提取.docx文本转CSV用于K-means聚类的步骤

阿华AIGC实验室

2026-5-22

Step-by-Step Guide for Your K-means Clustering Project (from .docx to CSV & Beyond)

Hey there! As a fellow Python learner who’s tackled similar tasks, let’s break this down into easy-to-follow steps tailored for beginners.

1. Extract Text from .docx Files

First, we need to pull all the text content out of your .docx document. The python-docx library is perfect for this—it’s simple and beginner-friendly.

Install the library first:
```
pip install python-docx
```

Then use this code to extract text:

from docx import Document

def extract_docx_text(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

# Replace with your .docx file path
raw_text = extract_docx_text("your_document.docx")
print(raw_text[:500])  # Print first 500 chars to check if it worked

This will grab all paragraphs from your document and combine them into a single, readable string.

2. Text Preprocessing (The Most Critical Step!)

Raw text is messy—we need to clean it up so it’s usable for clustering. Here’s what you’ll need to do (adjust based on whether your text is Chinese or English):

For Chinese Text:

Segmentation (分词): Use jieba to split sentences into individual words.
```
pip install jieba
```
Remove Stopwords: Filter out meaningless words like "的", "了", "是"—you can find a Chinese stopword list online (save it as a .txt file, one word per line).
Clean Special Characters: Get rid of punctuation, numbers, and symbols.

Example code:

import jieba
import re

# Load stopwords
def load_stopwords(stopword_path):
    with open(stopword_path, 'r', encoding='utf-8') as f:
        stopwords = set(f.read().splitlines())
    return stopwords

stopwords = load_stopwords("chinese_stopwords.txt")

def preprocess_chinese(text):
    # Remove special chars and numbers
    clean_text = re.sub(r'[^\w\s]|[\d]', '', text)
    # Segment words
    words = jieba.lcut(clean_text)
    # Filter stopwords and empty strings
    filtered_words = [word for word in words if word not in stopwords and word.strip() != '']
    return filtered_words

processed_words = preprocess_chinese(raw_text)
print(processed_words[:20])  # Check first 20 cleaned words

For English Text:

Tokenization: Split into words using nltk or basic string methods.
Lowercasing: Convert all text to lowercase to avoid treating "Cat" and "cat" as different words.
Remove Stopwords: Use nltk’s built-in stopword list.
Stemming/Lemmatization: Reduce words to their root form (e.g., "running" → "run").

Example code (using nltk):

pip install nltk

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download nltk resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stopwords = set(stopwords.words('english'))

def preprocess_english(text):
    # Lowercase everything
    text = text.lower()
    # Remove special chars and numbers
    clean_text = re.sub(r'[^\w\s]|[\d]', '', text)
    # Split into individual words
    words = word_tokenize(clean_text)
    # Filter stopwords and simplify word forms
    filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords and word.strip() != '']
    return filtered_words

processed_words = preprocess_english(raw_text)
print(processed_words[:20])

3. Convert Processed Text to CSV Format

Now that we have clean words, we can save them to a CSV. Two common options work here—pick the one that fits your clustering needs:

Option 1: Save Cleaned Word List

Use pandas (super intuitive for CSV handling):

pip install pandas

import pandas as pd

# Create a DataFrame with your cleaned words
df = pd.DataFrame({'Cleaned_Words': processed_words})
# Save to CSV (no extra index column, use UTF-8 for compatibility)
df.to_csv("cleaned_words.csv", index=False, encoding='utf-8')

Option 2: Save Word Frequency Count (Better for Clustering)

Count how often each word appears—this gives you more meaningful data for K-means:

from collections import Counter

# Count word occurrences
word_counts = Counter(processed_words)
# Convert to a DataFrame
count_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency'])
# Sort by frequency (optional but helpful)
count_df = count_df.sort_values(by='Frequency', ascending=False)
# Save to CSV
count_df.to_csv("word_frequencies.csv", index=False, encoding='utf-8')

4. Prepare Data for K-means Clustering (Bonus!)

Since you mentioned K-means, you’ll need to convert text into numerical features (machines can’t read words directly). The most common method is TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# If you have multiple paragraphs, process each one separately
doc = Document("your_document.docx")
paragraphs = [para.text for para in doc.paragraphs if para.text.strip() != '']
# Clean each paragraph (use preprocess_english if needed)
cleaned_paragraphs = [preprocess_chinese(p) for p in paragraphs]
# Join words back into strings for TF-IDF
cleaned_strings = [' '.join(words) for words in cleaned_paragraphs]

# Create TF-IDF numerical features
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_strings)

# Run K-means clustering
num_clusters = 3  # Adjust this number based on your data
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(tfidf_matrix)

# Add cluster labels to your data and save to CSV
result_df = pd.DataFrame({'Paragraph': paragraphs, 'Cluster': kmeans.labels_})
result_df.to_csv("clustered_paragraphs.csv", index=False, encoding='utf-8')

This will group similar paragraphs into clusters based on their content!

内容的提问来源于stack exchange，提问作者bernard