如何在PySpark Structured Streaming应用中统计单文件处理时长并自监控？

阿华AIGC实验室

2026-5-14

在PySpark Structured Streaming中测量执行时长与性能分析

当然可以！你完全可以在脚本内部实现执行时长的测量和性能分析，不用再手动复制文件后查看时长啦。下面给你几种实用的方案，适配你的Structured Streaming场景：

方案1：利用`StreamingQueryListener`监听查询生命周期

这是最贴合Structured Streaming特性的方式，能精准捕获每个批次的处理时长、输入行数等核心性能指标，非常适合持续监控流处理的运行状态。

你可以自定义一个监听器类，绑定到SparkSession上：

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.sql.streaming import StreamingQueryListener
import time

class PerformanceListener(StreamingQueryListener):
    def onQueryStarted(self, event):
        print(f"查询启动时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")
        self.start_time = time.time()

    def onQueryProgress(self, event):
        # 计算当前批次的处理时长
        batch_duration = time.time() - self.start_time
        progress = event.progress
        print(f"\n=== 批次 {progress.batchId} 性能指标 ===")
        print(f"处理时长: {batch_duration:.2f} 秒")
        print(f"输入行数: {progress.numInputRows}")
        print(f"处理速率: {progress.processedRowsPerSecond:.2f} 行/秒")
        self.start_time = time.time()  # 重置时间，准备监控下一个批次

    def onQueryTerminated(self, event):
        print(f"\n查询终止时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"终止原因: {event.exception if event.exception else '正常终止'}")

if __name__ == "__main__":
    spark = SparkSession.builder.master('spark://localhost:7077').getOrCreate()
    # 注册自定义性能监听器
    listener = PerformanceListener()
    spark.streams.addListener(listener)

    # 替换成你实际的Schema定义
    pq_schema = StructType()
    df = spark.readStream.schema(pq_schema).parquet('../data/parquet')
    df.createOrReplaceTempView("vw_table")
    
    exec_query = spark.sql("""
        select sum(field_1), count(field_2), field_3
        from vw_table
        group by field_3
    """)
    
    result_q = exec_query.writeStream.outputMode("complete").format("console").start()
    result_q.awaitTermination()

这个监听器会自动监听流查询的启动、批次进度和终止事件，实时输出每个批次的处理细节，帮你直观掌握流处理的性能表现。

方案2：手动测量关键代码段的执行时间

如果你只需要关注特定环节的耗时（比如Schema加载、查询初始化等），可以用time模块手动打点，快速定位单个步骤的耗时：

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
import time

if __name__ == "__main__":
    start_total = time.time()
    
    # 测量SparkSession初始化耗时
    start_spark = time.time()
    spark = SparkSession.builder.master('spark://localhost:7077').getOrCreate()
    print(f"SparkSession初始化耗时: {time.time() - start_spark:.2f} 秒")

    # 测量DataFrame与视图初始化耗时
    start_df = time.time()
    pq_schema = StructType()  # 替换成你的实际Schema
    df = spark.readStream.schema(pq_schema).parquet('../data/parquet')
    df.createOrReplaceTempView("vw_table")
    print(f"DataFrame与视图初始化耗时: {time.time() - start_df:.2f} 秒")

    # 测量查询编译耗时
    start_query = time.time()
    exec_query = spark.sql("""
        select sum(field_1), count(field_2), field_3
        from vw_table
        group by field_3
    """)
    print(f"查询编译耗时: {time.time() - start_query:.2f} 秒")

    # 启动流查询并等待终止
    result_q = exec_query.writeStream.outputMode("complete").format("console").start()
    result_q.awaitTermination()
    print(f"\n总执行时长: {time.time() - start_total:.2f} 秒")

这种方式适合离线调试阶段快速定位单个环节的性能瓶颈，但无法跟踪每个流批次的处理情况。

额外小提示：结合Spark UI做深度分析

虽然你要在脚本内实现测量，但也可以在脚本启动后，访问http://localhost:4040（默认端口）打开Spark UI，里面的Streaming标签页能看到更详细的批次统计、任务执行时间、Shuffle情况等。把脚本内的日志和Spark UI的数据结合起来，能更全面地分析性能瓶颈。

内容的提问来源于stack exchange，提问作者user307283