如何复刻结合运行时间与MAE的排序模型可视化图表

阿华AIGC实验室

2026-4-14

首先得先帮你理清核心问题：你需要同时保留**MAE性能（越小越好）和运行时间（越短越好）**的信息，还要生成能体现模型综合排名的可视化。先给你修正下原始数据的小问题——Python字典里不能直接写0:00:43.387145这种时间格式，得转成字符串后续再处理，不然DataFrame会直接报错。

接下来咱们一步步实现接近目标的方案：

一、数据预处理：统一格式+计算综合排名

首先把运行时间转成可计算的秒数，然后分别对MAE和运行时间做排名，最后取平均得到综合排名（两个指标都是越小越优，所以排名用升序）：

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 修正数据格式：运行时间用字符串存储
data1 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.906, 2.739, 2.826, 5.234, 5.061, 2.454, 2.76],
    'Runtime': ['0:00:43.387145', '0:28:11.761681', '0:03:58.883474', '0:01:44.252276' , '0:04:52.754769', '0:19:36.925169', '0:04:51.223103']
}
data2 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511],
    'Runtime': ['0:00:45.055854', '0:10:55.468473', '0:01:01.575033' , '0:00:31.231719' , '0:02:12.258870', '0:08:16.526615' , '0:15:25.084937']
}
data3 = {
    'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
    'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511],
    'Runtime': ['0:00:40.055854', '0:11:55.468473', '0:01:03.575033' , '0:00:29.231719' , '0:02:02.258870', '0:07:16.526615' , '0:13:25.084937']
}

# 转成DataFrame并处理时间指标
dfs = [pd.DataFrame(data) for data in [data1, data2, data3]]
for df in dfs:
    # 把时间转成总秒数，方便计算排名
    df['Runtime_seconds'] = pd.to_timedelta(df['Runtime']).dt.total_seconds()
    # 计算单指标排名：MAE越小排名越前，运行时间越短排名越前
    df['MAE_Rank'] = df['MAE'].rank(ascending=True)
    df['Runtime_Rank'] = df['Runtime_seconds'].rank(ascending=True)
    # 综合排名：取两个指标排名的平均值，数值越小综合性能越好
    df['Overall_Rank'] = (df['MAE_Rank'] + df['Runtime_Rank']) / 2

# 合并三个数据集，添加数据集标识
for idx, df in enumerate(dfs, 1):
    df['Dataset'] = f'Dataset {idx}'
combined_df = pd.concat(dfs, ignore_index=True)

二、复刻可视化图表的三种方案

根据你参考图的需求，这里给你提供三种不同风格的实现，都能同时体现MAE、运行时间和综合排名：

方案1：双轴散点图（直观展示指标关系）

这个方案可以清晰看到每个模型在不同数据集下的MAE和运行时间分布，同时按综合排名排序模型：

plt.figure(figsize=(12, 8))
sns.set_style("whitegrid")

# 按综合排名的中位数排序模型，保证x轴是从优到劣的顺序
model_order = combined_df.groupby('Models')['Overall_Rank'].median().sort_values().index

# 绘制MAE散点
sns.scatterplot(data=combined_df, x='Models', y='MAE', hue='Dataset', s=150, marker='o', ax=plt.gca())
# 双轴绘制运行时间散点
ax2 = plt.gca().twinx()
sns.scatterplot(data=combined_df, x='Models', y='Runtime_seconds', hue='Dataset', s=150, marker='s', ax=ax2, legend=False)

# 优化图表样式
plt.gca().set_xticklabels(model_order, rotation=45, ha='right')
plt.gca().set_ylabel('MAE (数值越小性能越好)')
ax2.set_ylabel('运行时间(秒，数值越小效率越高)')
plt.title('不同数据集下模型性能与运行时间对比')
plt.tight_layout()
plt.show()

方案2：综合排名热力图（汇总跨数据集表现）

如果参考图是偏向“排名汇总”的风格，这个热力图可以直观展示每个模型在不同数据集的综合排名，旁边再附上指标统计：

# 构建排名矩阵：行是模型，列是数据集
rank_matrix = combined_df.pivot(index='Models', columns='Dataset', values='Overall_Rank').reindex(model_order)

plt.figure(figsize=(10, 6))
sns.heatmap(rank_matrix, annot=True, cmap='coolwarm_r', fmt='.1f', cbar_kws={'label': '综合排名(数值越小越好)'})
plt.title('各模型在不同数据集的综合排名')
plt.tight_layout()
plt.show()

# 输出指标汇总表，同时保留MAE和运行时间
summary_table = combined_df.groupby('Models').agg(
    平均MAE=('MAE', 'mean'),
    MAE标准差=('MAE', 'std'),
    平均运行时间=('Runtime_seconds', lambda x: str(pd.to_timedelta(x.mean(), unit='s'))),
    运行时间标准差=('Runtime_seconds', lambda x: str(pd.to_timedelta(x.std(), unit='s')))
).reindex(model_order)
print("模型性能与运行时间汇总表：")
print(summary_table.round(2))

方案3：归一化双指标条形图（类似参考图的组合展示）

如果参考图是把模型按综合排名排序，同时展示两个指标的相对表现，用归一化后的条形图可以实现：

# 对MAE和运行时间做0-1归一化，0代表最优，1代表最差
normalized_df = combined_df.copy()
normalized_df['MAE_归一化'] = (normalized_df['MAE'] - normalized_df['MAE'].min()) / (normalized_df['MAE'].max() - normalized_df['MAE'].min())
normalized_df['Runtime_归一化'] = (normalized_df['Runtime_seconds'] - normalized_df['Runtime_seconds'].min()) / (normalized_df['Runtime_seconds'].max() - normalized_df['Runtime_seconds'].min())

plt.figure(figsize=(12, 8))
sns.set_style("whitegrid")

bar_width = 0.35
x = range(len(model_order))

# 分数据集绘制堆叠条形图
for idx, dataset in enumerate(normalized_df['Dataset'].unique()):
    subset = normalized_df[normalized_df['Dataset'] == dataset].set_index('Models').reindex(model_order)
    # 绘制MAE归一化值
    plt.bar([xi + idx*bar_width for xi in x], subset['MAE_归一化'], width=bar_width, label=f'{dataset} - MAE(归一化)')
    # 绘制运行时间归一化值（堆叠在MAE上方）
    plt.bar([xi + idx*bar_width for xi in x], subset['Runtime_归一化'], width=bar_width, bottom=subset['MAE_归一化'], label=f'{dataset} - 运行时间(归一化)')

# 优化图表样式
plt.xticks([xi + bar_width for xi in x], model_order, rotation=45, ha='right')
plt.ylabel('归一化值(0=最优，1=最差)')
plt.title('按综合排名排序的模型双指标对比')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

三、保留运行时间的排名与频率统计

你之前的排名表丢失了运行时间信息，现在可以生成同时包含所有关键指标的排名汇总，还能计算Top模型的出现频率：

# 生成按综合排名排序的汇总表
ranked_summary = combined_df.groupby('Models').agg(
    平均MAE=('MAE', 'mean'),
    平均运行时间=('Runtime', lambda x: str(pd.to_timedelta(pd.to_timedelta(x).mean()))),
    平均综合排名=('Overall_Rank', 'mean')
).sort_values('平均综合排名').reset_index()

# 计算Top3模型的出现频率
top3_models = ranked_summary.head(3)['Models'].tolist()
top3_count = combined_df[combined_df['Models'].isin(top3_models)].groupby('Dataset').size().count()
top3_freq = (top3_count / len(dfs)) * 100

print("带运行时间的模型综合排名汇总：")
print(ranked_summary.round(2))
print(f"\nTop3模型({', '.join(top3_models)}) 在 {top3_count} 个数据集里表现最优，出现频率：{top3_freq:.2f}%")

这样就能完美复刻同时体现性能、效率和排名的可视化，还不会丢失任何关键数据啦～

备注：内容来源于stack exchange，提问作者Mario