如何将JSON文件读取为Pandas DataFrame?Python3.6大JSON处理报错求解
Hey there! Let's break down the common issues and fixes when trying to load a 350MB JSON file into a Pandas DataFrame with Python 3.6—this size can trip up even seasoned devs, but we've got solutions:
1. Memory Overload (Most Common)
A 350MB JSON file can balloon to several GB when loaded into memory, especially if it has nested structures or repeated data. Try these fixes:
- Read in chunks: Use Pandas'
chunksizeparameter to process the file in smaller batches, then combine them if needed:import pandas as pd # Read 10,000 records at a time (adjust based on your memory) chunk_iter = pd.read_json("your_large_file.json", chunksize=10000) df_list = [] for chunk in chunk_iter: # Optional: Clean/process each chunk here to save memory df_list.append(chunk) final_df = pd.concat(df_list, ignore_index=True) - Convert to a memory-efficient format: Once you can read chunks, save the data to a Parquet or Feather file (these formats compress data and use less memory). Later, you can load the Parquet file in one go:
# Save chunks to Parquet chunk_iter = pd.read_json("your_large_file.json", chunksize=10000) for i, chunk in enumerate(chunk_iter): chunk.to_parquet(f"chunk_{i}.parquet") # Load all chunks back into a DataFrame import glob parquet_files = glob.glob("chunk_*.parquet") final_df = pd.concat([pd.read_parquet(file) for file in parquet_files], ignore_index=True) - Optimize DataFrame memory: After loading, shrink the DataFrame's footprint by downcasting data types:
# Check current memory usage print(final_df.memory_usage(deep=True)) # Convert numeric columns to smaller types final_df["numeric_col"] = pd.to_numeric(final_df["numeric_col"], downcast="integer") # Convert string columns to category if there are few unique values final_df["string_col"] = final_df["string_col"].astype("category")
2. JSON Format Issues
If your JSON isn't structured as a flat array or uses line-delimited entries, Pandas might throw parsing errors:
- Line-delimited JSON: If each line is a separate JSON object, use the
lines=Trueparameter (make sure your Pandas version supports this—Python 3.6 works with Pandas 1.1.x+):df = pd.read_json("your_large_file.json", lines=True, chunksize=10000) - Nested JSON: Use
pd.json_normalize()to flatten nested structures, but do it in chunks to avoid memory issues:import json import pandas as pd chunks = [] with open("your_large_file.json", "r") as f: for line in f: record = json.loads(line) chunks.append(pd.json_normalize(record)) final_df = pd.concat(chunks, ignore_index=True)
3. Python 3.6 & Pandas Version Mismatch
Python 3.6 is end-of-life, but if you have to use it, make sure you're on a compatible Pandas version (the latest supported for 3.6 is Pandas 1.1.5). Older versions might have bugs with large JSON files:
pip install --upgrade pandas==1.1.5
4. Stream with ijson (For Extra Large Files)
If even chunked reading isn't working, use the ijson library to parse the JSON streamingly, loading only one record at a time:
import ijson import pandas as pd records = [] with open("your_large_file.json", "r") as f: # Replace 'item' with the path to your records (e.g., 'data.item' if nested under 'data') for record in ijson.items(f, "item"): records.append(record) final_df = pd.DataFrame(records)
内容的提问来源于stack exchange,提问作者Alberto Alvarez




