生产环境中可能出现部分大查询耗尽整个集群资源的情况,这会影响其他查询的性能与可用性,需迅速定位最近占用资源较多的大查询或大作业并处理,以降低对业务的整体影响。本文将指导您如何快速定位并处理这些大查询。
产品形态 | 引擎 | 部署方式 | 适用版本 |
|---|---|---|---|
Serverless 实例 | Doris | 存算一体 |
|
您可以通过 Doris 自身的审计日志进行诊断,对运行结束的作业进行定位,判断是否为需要治理的大查询作业。诊断时:
说明
cpuCost 可理解为占用的 CPU 时间片,例如集群为 8 * 48core,如果查询的 cpuCost = 3840秒,意味着需要将集群 CPU 打满 100% 持续 10s 才能查询结束。
-- 统计 Insert/no-Insert 查询类型每小时的相关指标 SELECT DATE_FORMAT(time, '%Y-%m-%d %H:00:00') AS dt, CASE WHEN (stmt LIKE '%INSERT%' OR stmt LIKE '%insert%') THEN 'isInsert' ELSE 'isQuery' END AS `type`, ROUND(SUM(scan_bytes / 1024 / 1024), 0) AS sum_scan_mb, ROUND(AVG(scan_bytes / 1024 / 1024), 0) AS avg_scan_mb, ROUND(SUM(cpu_time_ms / 1000), 0) AS sum_cpu_time_sec, ROUND(AVG(cpu_time_ms / 1000), 0) AS avg_cpu_time_sec, ROUND(SUM(query_time / 1000), 0) AS sum_query_time_sec, ROUND(AVG(query_time / 1000), 0) AS avg_query_time_sec, COUNT(*) AS total_query_count FROM __internal_schema.audit_log WHERE time BETWEEN '2025-06-19 00:00:00' AND '2025-06-25 23:59:00' AND stmt != 'SELECT 1' AND stmt NOT LIKE '%@@%' AND user != 'root' GROUP BY 1, 2 ORDER BY 2, 1; -- 统计 cpuCost 最高的 SQL SELECT time, user, db, state, error_code, error_message, query_time / 1000 AS queryTimeSec, ROUND(scan_bytes / 1024 / 1024 / 1024, 2) AS scanGB, scan_rows, ROUND(cpu_time_ms / 1000, 0) AS cpuCostSec, ROUND(peak_memory_bytes / 1024 / 1024 / 1024, 2) AS memCostGB, stmt FROM __internal_schema.audit_log WHERE time BETWEEN '2025-06-21 10:00:00' AND '2025-06-21 11:00:00' ORDER BY cpu_time_ms DESC LIMIT 50; -- 按 digest 统计 cpuCost(digest: sql 指纹,通常可以代表一类 sql,可以判断出用户的接口) -- Doris 默认开启 SELECT sql_digest, AVG(query_time) / 1000 AS avgQueryTimeSec, ROUND(AVG(scan_bytes) / 1024 / 1024 / 1024, 2) AS avgScanGB, AVG(scan_rows) AS avgScanRows, ROUND(AVG(cpu_time_ms) / 1000, 0) AS avgCpuCostSec, ROUND(MAX(peak_memory_bytes) / 1024 / 1024 / 1024, 2) AS maxMemCostGB FROM __internal_schema.audit_log WHERE time BETWEEN '2025-06-21 10:00:00' AND '2025-06-21 11:00:00' GROUP BY sql_digest ORDER BY avgCpuCostSec DESC LIMIT 50;
查看当前查询的资源消耗。详情请参考:Doris 查看运行中的查询,以下为几个典型场景的查询示例。
CPU 使用 topN 的 SQL
SELECT t1.query_id AS be_query_id, t1.query_type, t2.query_id, t2.workload_group_id, t2.`database`, t1.cpu_time, t2.`sql` FROM ( SELECT query_id, query_type, SUM(task_cpu_time_ms) AS cpu_time FROM information_schema.backend_active_tasks GROUP BY query_id, query_type ) t1 LEFT JOIN active_queries t2 ON t1.query_id = t2.query_id WHERE query_type != 'LOAD' ORDER BY cpu_time DESC LIMIT 10;
内存使用 topN 的 SQL
SELECT t1.query_id AS be_query_id, t1.query_type, t2.query_id, t2.workload_group_id, t1.mem_used FROM ( SELECT query_id, query_type, SUM(current_used_memory_bytes) AS mem_used FROM information_schema.backend_active_tasks GROUP BY query_id, query_type ) t1 LEFT JOIN active_queries t2 ON t1.query_id = t2.query_id WHERE query_type != 'LOAD' ORDER BY mem_used DESC LIMIT 10;
扫描数据量 topN 的 SQL
SELECT t1.query_id AS be_query_id, t1.query_type, t2.query_id, t2.workload_group_id, t1.scan_rows, t1.scan_bytes FROM ( SELECT query_id, query_type, SUM(scan_rows) AS scan_rows, SUM(scan_bytes) AS scan_bytes FROM information_schema.backend_active_tasks GROUP BY query_id, query_type ) t1 LEFT JOIN active_queries t2 ON t1.query_id = t2.query_id WHERE query_type != 'LOAD' ORDER BY scan_rows DESC, scan_bytes DESC LIMIT 10;
返回结果通常包括以下几列。
终止 CPU 占用时间(CPUTime)较高的查询(query)。详情请参考:Doris Kill。
基于 queryId 终止大查询。
-- kill对应的queryId kill query {queryId};
如果版本较低,可能只支持直接 kill connection。那就需要先执行 show processlist 获取连接信息,再根据 query_id 找到对应的连接 Id,命令详情请参见SHOW PROCESSLIST、终止查询。以下为一个简单的示例。
-- 比如我们拿到 query_id 1710f53deaed42b6-aa97fd39292eef56 SET SHOW_ALL_FE_CONNECTION = TRUE; SHOW FULL PROCESSLIST; +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+ | CurrentConnected | Id | User | Host | LoginTime | Catalog | Db | Command | Time | State | QueryId | FE | CloudCluster | +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+ | No | 4676 | doris_ad_high_dim_local_ssd_emr_lf_r | [fdbd:dc01:a:391:0:0:0:173]:37462 | 2026-01-27 18:07:54 | internal | information_schema | Sleep | 48 | ERR | errCode = 2, detailMessage = mismatched input '{' expecting {')', ',', ';', 'ORDER'}(line 1, pos 532) | 2605:340:cd51:4500:632a:3213:457f:88ef | NULL | | No | 4358 | root | 127.0.0.1:16210 | 2026-01-27 16:12:23 | internal | ad_high_dim_local | Sleep | 231 | EOF | 1710f53deaed42b6-aa97fd39292eef56 | 2605:340:cd51:4500:632a:3213:457f:88ef | NULL | +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+ -- 可以看到 1710f53deaed42b6-aa97fd39292eef56 对应第二条记录,Connection Id 为 4358 -- 直接 kill connection KILL 4358;