SQL Server并行ETL管控存储过程提前无错终止问题排查求助
SQL Server并行ETL管控存储过程提前无错终止问题排查求助
各位大佬好,最近碰到个百思不得其解的诡异问题,想请大家帮忙排查下:
我们团队在SQL Server上搭建了一套端到端的落地库数据加载流程,原来串行跑各个数据源的重导脚本,后来实在太慢,改成用并行执行+管控存储过程的模式。管控过程叫run_e2e_landing,核心逻辑是循环等待所有并行重导子过程完成,每分钟在landing_parallel_log表写一条心跳日志,还加了个安全兜底:如果总运行时间超过2小时就自动终止(至今没触发过,触发也会打日志)。改成并行后大部分时候都很顺利,直接把总运行时间砍了1小时以上,效果拉满。
但最近偶尔会出状况:某次运行里有一个重导过程跑了超1小时(比如72分钟,远早于2小时的自终止阈值),这时候管控存储过程会毫无征兆地突然终止——没有任何错误日志,就像进程被直接干掉了一样,而且肯定不是触发了自终止逻辑(时间差太多了)。更奇怪的是,那个跑慢的重导过程自己会正常完成,它有独立的日志能证明执行成功,只是比平时久了一点。而且每次出问题的重导过程都不固定,唯一的共同点就是运行时间超1小时。
我们已经排查了这些方向:
- 确认SQL Server存储过程本身没有超时设置
- 调用管控过程的SQL Agent作业也没有配置超时
- 问题发生时间远早于2小时的自终止代码触发时机
- 慢的重导过程不固定,没有指向特定数据源的规律
现在我们怀疑会不会是操作系统层面的问题?比如Windows系统里有什么后台清理进程,误以为长时间运行的管控存储过程挂起了,把它强制杀掉了?
下面是整个管控存储过程的代码,问题100%发生在监控并行任务的阶段(也就是代码里的循环等待逻辑段,原代码粘贴时可能有截断,但核心监控逻辑就在后续循环里):
create procedure integration.run_e2e_landing @mode nvarchar(10) = 'execute' , @TablePattern nvarchar(50) = '%' as BEGIN set nocount on set xact_abort on ---------------------------------------- --Initialisation ---------------------------------------- declare @source_system nvarchar(255) = 'XXX_Landing' declare @procedure_name nvarchar(255) = object_schema_name(@@procid) + '.' + object_name(@@procid) declare @subprocedure_name nvarchar(255) declare @etl_start datetime2 = sysdatetime() declare @etl_end datetime2 declare @error_log nvarchar(MAX) = 'Successful' declare @error_status int = 0 declare @missing_job nvarchar(128) = '' declare @parallel_job_name nvarchar(128) declare @running_job_count int declare @running_job_cutoff datetime2 declare @update_key int declare @Duration int declare @Minutes int declare @Seconds int declare @SQL nvarchar(MAX) declare @ExcludedProcedureList table (ProcedureName nvarchar(128)) insert @ExcludedProcedureList (ProcedureName) values ('repopulate_sits_integration') raiserror ('--==============================================================================', 0, 1) with nowait raiserror ('--Run all end-to-end tasks on Landing', 0, 1) with nowait if @mode = 'debug' raiserror ( '--Debug mode on: statements will be generated but not executed', 0, 1) with nowait raiserror ('--==============================================================================', 0, 1) with nowait --Start Job Log insert meta.landing_job_log(source_system, procedure_name, etl_start, run_mode) values (@source_system, @procedure_name, @etl_start, @mode) set @update_key = ( select max(job_log_id) from meta.landing_job_log where etl_start = @etl_start and procedure_name = @procedure_name and etl_end is null ) --Delete Parallel Logs older than a month to keep size down delete from meta.landing_parallel_log where etl_start < dateadd(day, -30, sysdatetime()) ---------------------------------------- --Sanity checks ---------------------------------------- --Check all parallel jobs exist [0.2.03] if object_id('meta.landing_parallel_job') is null begin set @error_log = 'meta.landing_parallel_job does not exist' goto Housekeeping end set @missing_job = isnull( ( select top 1 job_name from meta.landing_parallel_job where job_name not in (select name from msdb.dbo.sysjobs) ) , '') if @missing_job <> '' begin set @error_log = @missing_job + ' does not exist' goto Housekeeping end ---------------------------------------- --Repopulate all other Landing tables ---------------------------------------- declare ProcedureCursor cursor for select distinct sch.name + '.' + obj.name as SchemaProcedureName from WaikatoODS_Landing.sys.objects obj join WaikatoODS_Landing.sys.schemas sch on (obj.schema_id = sch.schema_id) where obj.Type = 'P' --Procedure and obj.name like 'repopulate_%' and obj.name not in (select ProcedureName from @ExcludedProcedureList) --Ignore excluded procedures and ( (@mode = 'debug') or (@mode = 'execute' and obj.name not in (select job_name from meta.landing_parallel_job)) --Ignore parallel jobs ) order by SchemaProcedureName --Log repopulate procedure start insert meta.landing_parallel_log(etl_action, etl_start, error_log) values ('Monitoring repopulate procedures', sysdatetime(), 'Started') -- 监控并行任务的循环逻辑在此处(原代码粘贴截断)
想请教下各位:有没有碰到过类似的情况?会不会是Windows的什么机制导致的?或者SQL Server层面有什么我们没注意到的隐藏设置?求指点排查方向!




