如何从Apache Spark History Server导出主页应用列表为CSV？

阿华AIGC实验室

2026-5-20

刚好之前帮团队解决过类似需求，给你几个实用的方案，能精准导出你要的这些字段：

方法1：利用Spark History Server的REST API（最推荐）

Spark History Server本身自带了结构化的REST API，这是最稳定、不易出错的方式，完全不需要解析网页结构。

步骤：

先确认你的History Server访问地址，通常默认是 http://<你的历史服务器主机>:18080
调用应用列表的API端点获取JSON数据：
```
curl http://<你的历史服务器主机>:18080/api/v1/applications
```
返回的JSON里，字段对应关系完全匹配你的需求：
- App ID → id
- App Name → name
- Started → startTime（时间戳格式，可转成可读时间）
- Completed → endTime（未完成的应用会返回null）
- Duration → duration（毫秒数）
- Spark User → sparkUser
- Last Updated → lastUpdated（时间戳格式）
用jq（一款轻量的命令行JSON处理工具）直接转成CSV：
先确保安装了jq，然后运行这条命令就能直接生成符合要求的CSV文件：
```
curl -s http://<你的历史服务器主机>:18080/api/v1/applications | jq -r '["App ID","App Name","Started","Completed","Duration","Spark User","Last Updated"], (.[] | [.id, .name, (.startTime | todate), (.endTime | todate // "Running"), (.duration | tostring + " ms"), .sparkUser, (.lastUpdated | todate)]) | @csv' > spark_apps.csv
```
小说明：
- todate会把时间戳转成ISO标准的可读时间格式
- // "Running"是为了处理未完成的应用（当endTime为null时显示"Running"）
- 如果你想把Duration转成秒或者时分秒，把(.duration | tostring + " ms")改成(.duration / 1000 | tostring + " s")即可

方法2：网页抓取（API不可用时的备选方案）

如果因为权限或旧版本Spark限制用不了API，可以用网页抓取工具解析History Server的主页HTML。这里给你一个Python脚本示例（用BeautifulSoup）：

先安装依赖：
```
pip install requests beautifulsoup4
```

编写抓取脚本：

import requests
from bs4 import BeautifulSoup
import csv

# 替换成你的History Server地址
HISTORY_SERVER_URL = "http://<你的历史服务器主机>:18080"
response = requests.get(HISTORY_SERVER_URL)
soup = BeautifulSoup(response.text, 'html.parser')

# 定位应用列表表格
app_table = soup.find('table', class_='table table-striped table-bordered')
table_rows = app_table.find_all('tr')

# 生成CSV文件
with open('spark_apps.csv', 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)
    # 写入表头
    writer.writerow(["App ID", "App Name", "Started", "Completed", "Duration", "Spark User", "Last Updated"])
    
    # 遍历表格行（跳过表头行）
    for row in table_rows[1:]:
        columns = row.find_all('td')
        # 提取对应字段（注意顺序和网页表格一致）
        app_id = columns[0].text.strip()
        app_name = columns[1].text.strip()
        started = columns[2].text.strip()
        completed = columns[3].text.strip()
        duration = columns[4].text.strip()
        spark_user = columns[5].text.strip()
        last_updated = columns[6].text.strip()
        
        writer.writerow([app_id, app_name, started, completed, duration, spark_user, last_updated])

注意：这个方法依赖History Server的页面HTML结构，如果Spark版本更新导致页面布局变化，脚本可能需要调整，所以优先用API方案。

方法3：直接读取历史日志存储（仅适合集群管理员）

如果你是集群管理员，能访问History Server配置的日志存储目录（配置项spark.history.fs.logDirectory，可能是本地目录或HDFS），可以直接读取里面的*.inprogress或*.completed文件——这些是存储应用元数据的JSON文件，提取字段转成CSV即可。不过这个方法需要存储权限，处理起来也比API麻烦，一般只作为最后备选。