You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何将JSON文件读取为Pandas DataFrame?Python3.6大JSON处理报错求解

Fixing JSON-to-Pandas DataFrame Errors with 350MB Files in Python 3.6

Hey there! Let's break down the common issues and fixes when trying to load a 350MB JSON file into a Pandas DataFrame with Python 3.6—this size can trip up even seasoned devs, but we've got solutions:

1. Memory Overload (Most Common)

A 350MB JSON file can balloon to several GB when loaded into memory, especially if it has nested structures or repeated data. Try these fixes:

  • Read in chunks: Use Pandas' chunksize parameter to process the file in smaller batches, then combine them if needed:
    import pandas as pd
    
    # Read 10,000 records at a time (adjust based on your memory)
    chunk_iter = pd.read_json("your_large_file.json", chunksize=10000)
    df_list = []
    for chunk in chunk_iter:
        # Optional: Clean/process each chunk here to save memory
        df_list.append(chunk)
    final_df = pd.concat(df_list, ignore_index=True)
    
  • Convert to a memory-efficient format: Once you can read chunks, save the data to a Parquet or Feather file (these formats compress data and use less memory). Later, you can load the Parquet file in one go:
    # Save chunks to Parquet
    chunk_iter = pd.read_json("your_large_file.json", chunksize=10000)
    for i, chunk in enumerate(chunk_iter):
        chunk.to_parquet(f"chunk_{i}.parquet")
    
    # Load all chunks back into a DataFrame
    import glob
    parquet_files = glob.glob("chunk_*.parquet")
    final_df = pd.concat([pd.read_parquet(file) for file in parquet_files], ignore_index=True)
    
  • Optimize DataFrame memory: After loading, shrink the DataFrame's footprint by downcasting data types:
    # Check current memory usage
    print(final_df.memory_usage(deep=True))
    
    # Convert numeric columns to smaller types
    final_df["numeric_col"] = pd.to_numeric(final_df["numeric_col"], downcast="integer")
    # Convert string columns to category if there are few unique values
    final_df["string_col"] = final_df["string_col"].astype("category")
    

2. JSON Format Issues

If your JSON isn't structured as a flat array or uses line-delimited entries, Pandas might throw parsing errors:

  • Line-delimited JSON: If each line is a separate JSON object, use the lines=True parameter (make sure your Pandas version supports this—Python 3.6 works with Pandas 1.1.x+):
    df = pd.read_json("your_large_file.json", lines=True, chunksize=10000)
    
  • Nested JSON: Use pd.json_normalize() to flatten nested structures, but do it in chunks to avoid memory issues:
    import json
    import pandas as pd
    
    chunks = []
    with open("your_large_file.json", "r") as f:
        for line in f:
            record = json.loads(line)
            chunks.append(pd.json_normalize(record))
    final_df = pd.concat(chunks, ignore_index=True)
    

3. Python 3.6 & Pandas Version Mismatch

Python 3.6 is end-of-life, but if you have to use it, make sure you're on a compatible Pandas version (the latest supported for 3.6 is Pandas 1.1.5). Older versions might have bugs with large JSON files:

pip install --upgrade pandas==1.1.5

4. Stream with ijson (For Extra Large Files)

If even chunked reading isn't working, use the ijson library to parse the JSON streamingly, loading only one record at a time:

import ijson
import pandas as pd

records = []
with open("your_large_file.json", "r") as f:
    # Replace 'item' with the path to your records (e.g., 'data.item' if nested under 'data')
    for record in ijson.items(f, "item"):
        records.append(record)
final_df = pd.DataFrame(records)

内容的提问来源于stack exchange,提问作者Alberto Alvarez

火山引擎 最新活动