如何用Pandas将JSON转换为宽格式DataFrame并避免重复

阿华AIGC实验室

2026-5-9

解决JSON转Pandas DataFrame的宽格式展开问题

我来帮你搞定这个问题！你遇到的重复数据问题，本质是json_normalize默认会把嵌套的数组（比如这里的classes）展开成多行，导致同一text对应多条记录。要实现宽格式（每个分类的置信度作为单独列），我们可以先预处理collection里的每个条目，把classes数组转换成键值对字典，再进行标准化。

步骤1：原始JSON数据

首先先明确你的原始数据：

response = {
    "classifier_id": "xxxxx-xx-1",
    "url": "/testers/xxxxx-xx-1",
    "collection": [
        {
            "text": "How hot will it be today?",
            "top_class": "temperature",
            "classes": [
                {"class_name": "temperature", "confidence": 0.993},
                {"class_name": "conditions", "confidence": 0.006}
            ]
        },
        {
            "text": "Is it hot outside?",
            "top_class": "temperature",
            "classes": [
                {"class_name": "temperature", "confidence": 1.0},
                {"class_name": "conditions", "confidence": 0.0}
            ]
        }
    ]
}

步骤2：预处理collection条目

我们需要把每个条目中的classes数组转换成{分类名: 置信度}的字典，然后和原有的text、top_class字段合并：

import pandas as pd

# 预处理每个collection条目
processed_collection = []
for item in response["collection"]:
    # 将classes数组转为字典
    class_dict = {cls["class_name"]: cls["confidence"] for cls in item["classes"]}
    # 合并字段，生成新的条目
    processed_item = {
        "text": item["text"],
        "top_class": item["top_class"],
        **class_dict  # 展开字典的键值对作为新字段
    }
    processed_collection.append(processed_item)

步骤3：生成宽格式DataFrame

现在用json_normalize处理预处理后的列表，就能得到没有重复的宽格式结果：

df = pd.json_normalize(processed_collection)
print(df)

输出结果

运行后你会得到这样的DataFrame：

text  top_class  temperature  conditions
0  How hot will it be today?  temperature        0.993       0.006
1        Is it hot outside?  temperature        1.000       0.000