为何Python读取大文件速度慢?附10GB大CSV读取场景咨询
Great question—dealing with a 10GB CSV file (especially a 30k×30k float32 matrix) exposes some key bottlenecks in how Python handles text-based data and file I/O. Let’s break down the root causes first, then jump to practical fixes:
Key Reasons for Slow Performance
CSV Format Overhead: CSV is a text-based format, which means every
np.float32value was converted to a human-readable string when you wrote the file, and now has to be parsed back to a float when reading. This serialization/deserialization process is extremely costly for large datasets—multiply that by 900 million values, and you’re looking at massive CPU overhead. Splitting each line by commas (30k splits per line!) adds another layer of string-processing work, which Python isn’t optimized for compared to lower-level languages.np.genfromtxt’s Inherent Inefficiency: This function wasn’t built for ultra-large files. It does line-by-line processing, runs extensive type checks, and defaults to storing values as
float64(which doubles your memory footprint—hence the 100GB memory usage). The extra overhead of type inference and intermediate data structures makes it painfully slow for big datasets.Python GIL and Single-Threaded Parsing: If your custom read function uses pure Python code (like manual string splitting), it’s stuck running in a single thread due to the Global Interpreter Lock (GIL). This means you can’t leverage multiple CPU cores to speed up parsing, even if your system has them available.
Disk I/O (Secondary Factor): While mechanical hard drives can be a bottleneck, the bigger issue here is almost always the CPU cost of parsing text—not just reading bytes from disk.
Practical Fixes to Speed Up Reading
1. Ditch CSV for Binary Formats (Best Option)
CSV is terrible for large numerical datasets. Instead, use numpy’s native binary formats, which store data exactly as it exists in memory (no string conversion):
When writing:
import numpy as np # Replace your CSV write code with this np.save("large_matrix.npy", data) # data is your np.float32 matrix
When reading:
data = np.load("large_matrix.npy", mmap_mode=None) # Loads directly as float32, uses ~3.6GB memory
This will be orders of magnitude faster—no parsing, just direct memory mapping or loading.
2. Use Pandas with C-Accelerated Parsing
If you must stick with CSV, use pandas.read_csv which uses a C-based parser (way faster than pure Python):
import pandas as pd import numpy as np # Read directly as float32 to save memory df = pd.read_csv("large_file.csv", dtype=np.float32) data = df.to_numpy() # For even better memory control, read in chunks chunk_size = 1000 for chunk in pd.read_csv("large_file.csv", dtype=np.float32, chunksize=chunk_size): process_chunk(chunk.to_numpy()) # Replace with your processing logic
3. Optimize Your Custom Function
If you need to keep your custom reader, use these tweaks:
- Use the built-in
csv.reader(it’s implemented in C, faster than manual string splitting) - Memory-map the file with
mmapto reduce I/O overhead:import mmap import csv import numpy as np def read_large_csv_mmap(path, dtype=np.float32): with open(path, 'r') as f: with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm: reader = csv.reader(iter(mm.readline, b'')) data = np.array(list(reader), dtype=dtype) return data - Use multiprocessing to parse chunks in parallel (bypasses the GIL): split the file into chunks and process each chunk in a separate process.
内容的提问来源于stack exchange,提问作者Forrest Thumb




