Flink应用模式下RocksDB文件名过长崩溃的解决方案咨询
Flink 2.2.0 + Kubernetes Operator 1.4.0 下RocksDB文件名过长崩溃的根治方案(无需缩短应用名)
针对增量恢复时RocksDB临时文件名超长导致崩溃的问题,FLINK-31743的修复确实未覆盖所有场景,以下是几个无需缩短应用名称的可靠解决办法:
1. 用短ID占位符自定义RocksDB本地目录
直接在Flink配置里指定state.backend.rocksdb.localdir,利用Flink的短ID占位符压缩路径长度,同时保证唯一性:
flinkConfiguration: state.backend.rocksdb.localdir: /tmp/rdb/tm_${taskmanager.id:short}_job_${job.id:short}_op_${operator.id:short}
:short后缀会截取ID的前8位,把原来超长的UUID和算子ID压缩成短字符串,从根源减少文件名长度。
2. 彻底关闭RocksDB日志文件
降低日志级别无效的话,直接通过RocksDB原生配置禁用日志生成:
flinkConfiguration: state.backend.rocksdb.log.level: OFF # 强制关闭RocksDB日志文件生成 state.backend.rocksdb.extended-options: "keep_log_file_num=0;log_file_time_to_roll=0;log_file_size_to_roll=0"
keep_log_file_num=0会让RocksDB不创建任何日志文件,彻底避免日志文件名过长的问题。
3. Operator层面映射短作业名
在FlinkDeployment的YAML里,用spec.jobName设置一个短名称,不影响应用的业务标识和UI显示:
apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment spec: jobName: short-job-id # 这个名称会被用于生成RocksDB路径 name: hydra-sql-adr-assoc-device-and-login-features # 原应用名称,Flink UI显示这个 flinkVersion: v1_22 # 其他部署配置...
这样既保留了原应用的清晰名称,又缩短了RocksDB路径里的作业名部分。
4. 补丁修复增量恢复的临时文件命名
如果上面的配置都不生效,只能修改Flink代码补全FLINK-31743的修复:
- 找到
RocksDBIncrementalRestoreOperation类中生成临时DB路径的代码,把拼接的超长算子ID、UUID替换成短哈希值(比如MD5前8位)。 - 编译自定义的flink-statebackend-rocksdb jar包,替换集群里的对应依赖。
问题堆栈信息
[...] java.io.IOException: Error while opening RocksDB instance. at org.apache.flink.state.rocksdb.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:101) at org.apache.flink.state.rocksdb.restore.RestoredDBInstance.restoreTempDBInstanceFromLocalState(RestoredDBInstance.java:121) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.copyToBaseDBUsingTempDBs(RocksDBIncrementalRestoreOperation.java:788) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.mergeStateHandlesWithCopyFromTemporaryInstance(RocksDBIncrementalRestoreOperation.java:628) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.restoreFromMultipleStateHandles(RocksDBIncrementalRestoreOperation.java:446) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:326) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.lambda$restore$1(RocksDBIncrementalRestoreOperation.java:253) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.runAndReportDuration(RocksDBIncrementalRestoreOperation.java:893) at org.apache.flink.state.rocksdb.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:252) at org.apache.flink.state.rocksdb.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:390) ... 19 more Caused by: org.rocksdb.RocksDBException: While open a file for appending: /tmp/rdb/tmp_tm_hydra-sql-adr-assoc-device-and-login-features-taskmanager-1-10_tmp_job_41471278f6601d1a7ab05da6958d83f7_op_KeyedProcessOperator_d4d5e8c74c3d05d8a9a53a9c312a6161__1_5__uuid_aadf2786-a3dd-4fa9-acaa-59d560e05ce3_b5ea62d0-713f-46c4-bd4e-a4526f117f33_LOG: File name too long at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:315) at org.apache.flink.state.rocksdb.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:89)
内容的提问来源于stack exchange,提问作者Clemens Valiente




