Python入门者求助:提取.docx文本转CSV用于K-means聚类的步骤
Hey there! As a fellow Python learner who’s tackled similar tasks, let’s break this down into easy-to-follow steps tailored for beginners.
1. Extract Text from .docx Files
First, we need to pull all the text content out of your .docx document. The python-docx library is perfect for this—it’s simple and beginner-friendly.
- Install the library first:
pip install python-docx - Then use this code to extract text:
from docx import Document def extract_docx_text(file_path): doc = Document(file_path) full_text = [] for para in doc.paragraphs: full_text.append(para.text) return '\n'.join(full_text) # Replace with your .docx file path raw_text = extract_docx_text("your_document.docx") print(raw_text[:500]) # Print first 500 chars to check if it worked
This will grab all paragraphs from your document and combine them into a single, readable string.
2. Text Preprocessing (The Most Critical Step!)
Raw text is messy—we need to clean it up so it’s usable for clustering. Here’s what you’ll need to do (adjust based on whether your text is Chinese or English):
For Chinese Text:
- Segmentation (分词): Use
jiebato split sentences into individual words.pip install jieba - Remove Stopwords: Filter out meaningless words like "的", "了", "是"—you can find a Chinese stopword list online (save it as a .txt file, one word per line).
- Clean Special Characters: Get rid of punctuation, numbers, and symbols.
Example code:
import jieba import re # Load stopwords def load_stopwords(stopword_path): with open(stopword_path, 'r', encoding='utf-8') as f: stopwords = set(f.read().splitlines()) return stopwords stopwords = load_stopwords("chinese_stopwords.txt") def preprocess_chinese(text): # Remove special chars and numbers clean_text = re.sub(r'[^\w\s]|[\d]', '', text) # Segment words words = jieba.lcut(clean_text) # Filter stopwords and empty strings filtered_words = [word for word in words if word not in stopwords and word.strip() != ''] return filtered_words processed_words = preprocess_chinese(raw_text) print(processed_words[:20]) # Check first 20 cleaned words
For English Text:
- Tokenization: Split into words using
nltkor basic string methods. - Lowercasing: Convert all text to lowercase to avoid treating "Cat" and "cat" as different words.
- Remove Stopwords: Use
nltk’s built-in stopword list. - Stemming/Lemmatization: Reduce words to their root form (e.g., "running" → "run").
Example code (using nltk):
pip install nltk
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import re # Download nltk resources (run once) nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') lemmatizer = WordNetLemmatizer() stopwords = set(stopwords.words('english')) def preprocess_english(text): # Lowercase everything text = text.lower() # Remove special chars and numbers clean_text = re.sub(r'[^\w\s]|[\d]', '', text) # Split into individual words words = word_tokenize(clean_text) # Filter stopwords and simplify word forms filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords and word.strip() != ''] return filtered_words processed_words = preprocess_english(raw_text) print(processed_words[:20])
3. Convert Processed Text to CSV Format
Now that we have clean words, we can save them to a CSV. Two common options work here—pick the one that fits your clustering needs:
Option 1: Save Cleaned Word List
Use pandas (super intuitive for CSV handling):
pip install pandas
import pandas as pd # Create a DataFrame with your cleaned words df = pd.DataFrame({'Cleaned_Words': processed_words}) # Save to CSV (no extra index column, use UTF-8 for compatibility) df.to_csv("cleaned_words.csv", index=False, encoding='utf-8')
Option 2: Save Word Frequency Count (Better for Clustering)
Count how often each word appears—this gives you more meaningful data for K-means:
from collections import Counter # Count word occurrences word_counts = Counter(processed_words) # Convert to a DataFrame count_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency']) # Sort by frequency (optional but helpful) count_df = count_df.sort_values(by='Frequency', ascending=False) # Save to CSV count_df.to_csv("word_frequencies.csv", index=False, encoding='utf-8')
4. Prepare Data for K-means Clustering (Bonus!)
Since you mentioned K-means, you’ll need to convert text into numerical features (machines can’t read words directly). The most common method is TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans # If you have multiple paragraphs, process each one separately doc = Document("your_document.docx") paragraphs = [para.text for para in doc.paragraphs if para.text.strip() != ''] # Clean each paragraph (use preprocess_english if needed) cleaned_paragraphs = [preprocess_chinese(p) for p in paragraphs] # Join words back into strings for TF-IDF cleaned_strings = [' '.join(words) for words in cleaned_paragraphs] # Create TF-IDF numerical features vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(cleaned_strings) # Run K-means clustering num_clusters = 3 # Adjust this number based on your data kmeans = KMeans(n_clusters=num_clusters, random_state=42) kmeans.fit(tfidf_matrix) # Add cluster labels to your data and save to CSV result_df = pd.DataFrame({'Paragraph': paragraphs, 'Cluster': kmeans.labels_}) result_df.to_csv("clustered_paragraphs.csv", index=False, encoding='utf-8')
This will group similar paragraphs into clusters based on their content!
内容的提问来源于stack exchange,提问作者bernard




