You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

为何Python读取大文件速度慢?附10GB大CSV读取场景咨询

Why Is Reading Large CSV Files So Slow in Python?

Great question—dealing with a 10GB CSV file (especially a 30k×30k float32 matrix) exposes some key bottlenecks in how Python handles text-based data and file I/O. Let’s break down the root causes first, then jump to practical fixes:

Key Reasons for Slow Performance

  • CSV Format Overhead: CSV is a text-based format, which means every np.float32 value was converted to a human-readable string when you wrote the file, and now has to be parsed back to a float when reading. This serialization/deserialization process is extremely costly for large datasets—multiply that by 900 million values, and you’re looking at massive CPU overhead. Splitting each line by commas (30k splits per line!) adds another layer of string-processing work, which Python isn’t optimized for compared to lower-level languages.

  • np.genfromtxt’s Inherent Inefficiency: This function wasn’t built for ultra-large files. It does line-by-line processing, runs extensive type checks, and defaults to storing values as float64 (which doubles your memory footprint—hence the 100GB memory usage). The extra overhead of type inference and intermediate data structures makes it painfully slow for big datasets.

  • Python GIL and Single-Threaded Parsing: If your custom read function uses pure Python code (like manual string splitting), it’s stuck running in a single thread due to the Global Interpreter Lock (GIL). This means you can’t leverage multiple CPU cores to speed up parsing, even if your system has them available.

  • Disk I/O (Secondary Factor): While mechanical hard drives can be a bottleneck, the bigger issue here is almost always the CPU cost of parsing text—not just reading bytes from disk.

Practical Fixes to Speed Up Reading

1. Ditch CSV for Binary Formats (Best Option)

CSV is terrible for large numerical datasets. Instead, use numpy’s native binary formats, which store data exactly as it exists in memory (no string conversion):

When writing:

import numpy as np
# Replace your CSV write code with this
np.save("large_matrix.npy", data)  # data is your np.float32 matrix

When reading:

data = np.load("large_matrix.npy", mmap_mode=None)  # Loads directly as float32, uses ~3.6GB memory

This will be orders of magnitude faster—no parsing, just direct memory mapping or loading.

2. Use Pandas with C-Accelerated Parsing

If you must stick with CSV, use pandas.read_csv which uses a C-based parser (way faster than pure Python):

import pandas as pd
import numpy as np

# Read directly as float32 to save memory
df = pd.read_csv("large_file.csv", dtype=np.float32)
data = df.to_numpy()

# For even better memory control, read in chunks
chunk_size = 1000
for chunk in pd.read_csv("large_file.csv", dtype=np.float32, chunksize=chunk_size):
    process_chunk(chunk.to_numpy())  # Replace with your processing logic

3. Optimize Your Custom Function

If you need to keep your custom reader, use these tweaks:

  • Use the built-in csv.reader (it’s implemented in C, faster than manual string splitting)
  • Memory-map the file with mmap to reduce I/O overhead:
    import mmap
    import csv
    import numpy as np
    
    def read_large_csv_mmap(path, dtype=np.float32):
        with open(path, 'r') as f:
            with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
                reader = csv.reader(iter(mm.readline, b''))
                data = np.array(list(reader), dtype=dtype)
        return data
    
  • Use multiprocessing to parse chunks in parallel (bypasses the GIL): split the file into chunks and process each chunk in a separate process.

内容的提问来源于stack exchange,提问作者Forrest Thumb

火山引擎 最新活动