You need to enable JavaScript to run this app.
文档中心
E-MapReduce

E-MapReduce

复制全文
下载 pdf
大查询防御建设
大查询定位
复制全文
下载 pdf
大查询定位

生产环境中可能出现部分大查询耗尽整个集群资源的情况,这会影响其他查询的性能与可用性,需迅速定位最近占用资源较多的大查询或大作业并处理,以降低对业务的整体影响。本文将指导您如何快速定位并处理这些大查询。

适用版本

产品形态

引擎

部署方式

适用版本

Serverless 实例
(全托管)

Doris

存算一体

  • 2.1.x版本及以上
  • 1.2.x版本及以上

操作步骤

大查询定位:运行结束大查询/作业

您可以通过 Doris 自身的审计日志进行诊断,对运行结束的作业进行定位,判断是否为需要治理的大查询作业。诊断时:

  • 您可以主要关注 cpuCost、scanBytes、memCostBytes、queryTime 指标。

    说明

    cpuCost 可理解为占用的 CPU 时间片,例如集群为 8 * 48core,如果查询的 cpuCost = 3840秒,意味着需要将集群 CPU 打满 100% 持续 10s 才能查询结束。

  • 通过以下查询可定位 Top 任务,同时结合调度周期 / SQL 进行优化。即,结合监控初步定位硬件资源的瓶颈(cpu/mem/disk),再根据审计日志表定为到 cpu/mem/disk 使用资源最大的 Top 任务。
    以下为 Doris 审计日志分析的示例代码。
    -- 统计 Insert/no-Insert 查询类型每小时的相关指标
    SELECT
        DATE_FORMAT(time, '%Y-%m-%d %H:00:00') AS dt,
        CASE
            WHEN (stmt LIKE '%INSERT%' OR stmt LIKE '%insert%') THEN 'isInsert'
            ELSE 'isQuery'
        END AS `type`,
        ROUND(SUM(scan_bytes / 1024 / 1024), 0) AS sum_scan_mb,
        ROUND(AVG(scan_bytes / 1024 / 1024), 0) AS avg_scan_mb,
        ROUND(SUM(cpu_time_ms / 1000), 0) AS sum_cpu_time_sec,
        ROUND(AVG(cpu_time_ms / 1000), 0) AS avg_cpu_time_sec,
        ROUND(SUM(query_time / 1000), 0) AS sum_query_time_sec,
        ROUND(AVG(query_time / 1000), 0) AS avg_query_time_sec,
        COUNT(*) AS total_query_count
    FROM __internal_schema.audit_log
    WHERE time BETWEEN '2025-06-19 00:00:00' AND '2025-06-25 23:59:00'
      AND stmt != 'SELECT 1'
      AND stmt NOT LIKE '%@@%'
      AND user != 'root'
    GROUP BY 1, 2
    ORDER BY 2, 1;
    
    -- 统计 cpuCost 最高的 SQL
    SELECT
        time,
        user,
        db,
        state,
        error_code,
        error_message,
        query_time / 1000 AS queryTimeSec,
        ROUND(scan_bytes / 1024 / 1024 / 1024, 2) AS scanGB,
        scan_rows,
        ROUND(cpu_time_ms / 1000, 0) AS cpuCostSec,
        ROUND(peak_memory_bytes / 1024 / 1024 / 1024, 2) AS memCostGB,
        stmt
    FROM __internal_schema.audit_log
    WHERE time BETWEEN '2025-06-21 10:00:00' AND '2025-06-21 11:00:00'
    ORDER BY cpu_time_ms DESC
    LIMIT 50;
    
    -- 按 digest 统计 cpuCost(digest: sql 指纹,通常可以代表一类 sql,可以判断出用户的接口)
    -- Doris 默认开启
    SELECT
        sql_digest,
        AVG(query_time) / 1000 AS avgQueryTimeSec,
        ROUND(AVG(scan_bytes) / 1024 / 1024 / 1024, 2) AS avgScanGB,
        AVG(scan_rows) AS avgScanRows,
        ROUND(AVG(cpu_time_ms) / 1000, 0) AS avgCpuCostSec,
        ROUND(MAX(peak_memory_bytes) / 1024 / 1024 / 1024, 2) AS maxMemCostGB
    FROM __internal_schema.audit_log
    WHERE time BETWEEN '2025-06-21 10:00:00' AND '2025-06-21 11:00:00'
    GROUP BY sql_digest
    ORDER BY avgCpuCostSec DESC
    LIMIT 50;
    

大查询定位:运行中大查询/作业

  1. 查看当前查询的资源消耗。详情请参考:Doris 查看运行中的查询,以下为几个典型场景的查询示例。

    • CPU 使用 topN 的 SQL

      SELECT
          t1.query_id AS be_query_id,
          t1.query_type,
          t2.query_id,
          t2.workload_group_id,
          t2.`database`,
          t1.cpu_time,
          t2.`sql`
      FROM (
          SELECT
              query_id,
              query_type,
              SUM(task_cpu_time_ms) AS cpu_time
          FROM information_schema.backend_active_tasks
          GROUP BY query_id, query_type
      ) t1
      LEFT JOIN active_queries t2
          ON t1.query_id = t2.query_id
      WHERE query_type != 'LOAD'
      ORDER BY cpu_time DESC
      LIMIT 10;
      
    • 内存使用 topN 的 SQL

      SELECT
          t1.query_id AS be_query_id,
          t1.query_type,
          t2.query_id,
          t2.workload_group_id,
          t1.mem_used
      FROM (
          SELECT
              query_id,
              query_type,
              SUM(current_used_memory_bytes) AS mem_used
          FROM information_schema.backend_active_tasks
          GROUP BY query_id, query_type
      ) t1
      LEFT JOIN active_queries t2
          ON t1.query_id = t2.query_id
      WHERE query_type != 'LOAD'
      ORDER BY mem_used DESC
      LIMIT 10;
      
    • 扫描数据量 topN 的 SQL

      SELECT
          t1.query_id AS be_query_id,
          t1.query_type,
          t2.query_id,
          t2.workload_group_id,
          t1.scan_rows,
          t1.scan_bytes
      FROM (
          SELECT
              query_id,
              query_type,
              SUM(scan_rows) AS scan_rows,
              SUM(scan_bytes) AS scan_bytes
          FROM information_schema.backend_active_tasks
          GROUP BY query_id, query_type
      ) t1
      LEFT JOIN active_queries t2
          ON t1.query_id = t2.query_id
      WHERE query_type != 'LOAD'
      ORDER BY scan_rows DESC, scan_bytes DESC
      LIMIT 10;
      

    返回结果通常包括以下几列。

    • be_query_id
    • query_type:SELECT、LOAD,表明查询、导入等
    • workload_group_id:workload group 负载隔离分组信息
    • database:库名
    • cpu_time:当前查询已使用的CPU时间,单位秒。此为多个线程累加的CPU时间,举个例子,如果有两个线程分别占用 1 秒和 2 秒的CPU时间,那么累加起来的 CPU 时间为 3 秒
    • mem_used:当前占用的内存。如果查询涉及到多个BE节点,此值为该查询在所有BE节点上占用内存之和
  2. 终止 CPU 占用时间(CPUTime)较高的查询(query)。详情请参考:Doris Kill

    • 基于 queryId 终止大查询。

      -- kill对应的queryId
      kill query {queryId};
      
    • 如果版本较低,可能只支持直接 kill connection。那就需要先执行 show processlist 获取连接信息,再根据 query_id 找到对应的连接 Id,命令详情请参见SHOW PROCESSLIST终止查询。以下为一个简单的示例。

      -- 比如我们拿到 query_id 1710f53deaed42b6-aa97fd39292eef56
      SET SHOW_ALL_FE_CONNECTION = TRUE;
      SHOW FULL PROCESSLIST;
      +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+
      | CurrentConnected | Id   | User                                 | Host                | LoginTime           | Catalog  | Db       | Command | Time  | State | QueryId                           | FE   | CloudCluster |
      +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+
      | No               | 4676 | doris_ad_high_dim_local_ssd_emr_lf_r | [fdbd:dc01:a:391:0:0:0:173]:37462 | 2026-01-27 18:07:54 | internal | information_schema | Sleep | 48 | ERR | errCode = 2, detailMessage = mismatched input '{' expecting {')', ',', ';', 'ORDER'}(line 1, pos 532) | 2605:340:cd51:4500:632a:3213:457f:88ef | NULL |
      | No               | 4358 | root                                 | 127.0.0.1:16210     | 2026-01-27 16:12:23 | internal | ad_high_dim_local | Sleep | 231 | EOF | 1710f53deaed42b6-aa97fd39292eef56 | 2605:340:cd51:4500:632a:3213:457f:88ef | NULL |
      +------------------+------+--------------------------------------+---------------------+---------------------+----------+----------+---------+-------+-------+-----------------------------------+------+--------------+
      
      -- 可以看到 1710f53deaed42b6-aa97fd39292eef56 对应第二条记录,Connection Id 为 4358
      -- 直接 kill connection
      KILL 4358;
      

若无法快速定位大查询且需要迅速止损

  1. 若无法快速定位运行中的大查询,建议重启BE以终止所有查询恢复整体业务。
  2. 重启止损后,可参考 SQL诊断使用说明 或审计日志表对历史查询进行统计,结合业务排查大查询的来源并进行查询优化,详细操作请参考:查询调优
    -- 查看最近1h内CPU消耗最大的查询Top10
    SELECT * FROM __internal_schema.audit_log
    where time >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
    ORDER BY cpu_time_ms desc limit 10;
    
最近更新时间:2026.05.13 16:59:16
这个页面对您有帮助吗?
有用
有用
无用
无用