Apache Doris FE集群BDB环境故障:节点时钟偏差超出允许阈值
Apache Doris FE集群BDBEnvironment时钟同步故障处理
错误日志
2024-01-09 14:46:23,840 WARN (UNKNOWN fe_f78cf069_b094_4d9d_ac9c_ddc521dd494d(-1)|1) [BDBEnvironment.getDatabaseNames():332] bdb environment failure exception. will retry com.sleepycat.je.EnvironmentFailureException: (JE 18.3.12) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.12) fe_f78cf069_b094_4d9d_ac9c_ddc521dd494d(2147483647):/data/doris/fe/doris-meta/bdb Clock delta: -6647 ms. between Feeder: fe_06ad3169_b1f6_448f_b9df_5da1b440eb95 and this Replica exceeds max permissible delta: 5000 ms. HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity or compatibility check failed, preventing further communication between the nodes. Environment is invalid and must be closed. Originally thrown by HA thread: RepNode fe_f78cf069_b094_4d9d_ac9c_ddc521dd494d(-1)
故障根因
FE集群内节点间时钟偏差(日志中为-6647ms)超出BDB JE默认允许的最大阈值(5000ms),触发节点握手失败,导致BDBEnvironment强制关闭。
解决方案
1. 检测节点时钟状态
在所有FE节点上执行命令查看当前时间,确认偏差情况:
date
2. 配置时钟同步服务
部署NTP或Chrony服务,确保所有FE节点时钟保持同步:
- 安装Chrony(以CentOS/RHEL为例):
yum install chrony -y systemctl start chronyd systemctl enable chronyd - 手动强制同步:
chronyc -a makestep
3. 恢复FE服务
待所有节点时钟偏差控制在5000ms以内后,重启故障FE节点:
# 停止FE sh bin/stop_fe.sh # 启动FE sh bin/start_fe.sh --daemon
4. 可选:调整时钟偏差阈值(不推荐)
若因特殊场景无法严格同步时钟,可修改FE配置文件fe.conf中的bdb_je_max_clock_delta_ms参数,增大允许的偏差值,修改后需重启FE生效:
bdb_je_max_clock_delta_ms = 10000
内容的提问来源于stack exchange,提问作者user8589466




