在PySpark中以编程方式指定Schema：从RDD创建DataFrame

阿华AIGC实验室

2026-5-19

问题分析与解决方案

首先，你遇到的核心问题是直接将JSON字符串组成的RDD传入createDataFrame并指定Schema的方式不正确——Spark并不会自动解析这些JSON字符串，无法将单个字符串映射到你定义的StructType字段；而你看到的eyeColor: strin...只是控制台输出的截断显示，真正的隐患是你的DataFrame数据会不符合预期（比如所有字段值都会是null）。

下面是两种正确的解决方案：

方法1：先解析JSON为Row对象，再创建DataFrame

我们可以先把RDD中的每个JSON字符串解析成字典，再转换为Row对象，最后传入createDataFrame并绑定自定义Schema：

from pyspark.sql.types import StructField, StructType , LongType, StringType
from pyspark.sql import Row
import json

# 初始化JSON字符串RDD
stringJsonRdd_new = sc.parallelize((
    '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }',
    '{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green" }',
    '{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'
))

# 定义目标Schema
mySchema = StructType([
    StructField("id", LongType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True),
    StructField("name", StringType(), True)
])

# 解析JSON字符串为Row格式
parsed_rdd = stringJsonRdd_new.map(lambda x: Row(**json.loads(x)))

# 创建符合Schema的DataFrame
new_df = sqlContext.createDataFrame(parsed_rdd, mySchema)

# 验证结果
new_df.printSchema()
new_df.show()

方法2：使用Spark官方JSON读取API（更简洁高效）

Spark提供了专门的JSON数据处理工具，直接读取JSON字符串RDD并绑定Schema，这种方式是官方推荐的，代码更简洁且性能更优：

from pyspark.sql.types import StructField, StructType , LongType, StringType

# 初始化JSON字符串RDD
stringJsonRdd_new = sc.parallelize((
    '{"id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }',
    '{ "id": "234","name": "Michael", "age": 22, "eyeColor": "green" }',
    '{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }'
))

# 定义目标Schema
mySchema = StructType([
    StructField("id", LongType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True),
    StructField("name", StringType(), True)
])

# 直接读取RDD并绑定Schema
new_df = sqlContext.read.schema(mySchema).json(stringJsonRdd_new)

# 验证结果
new_df.printSchema()
new_df.show()