Kohya-SS SDXL LoRA训练成功加载状态后步数仍重置为0的问题求助
我最近在用Kohya的sd-scripts配合accelerate训练SDXL LoRA,已经开启了--save_state参数保存训练状态,但恢复训练时遇到了诡异的问题:日志明明显示状态文件已经成功加载,可训练步数却总是重置为0,进度条和epoch计数也从头开始,没法从上次保存的步骤(比如我上次停在500步)继续训练。
环境与关键配置
- 训练脚本:
/workspace/kohya_ss/sd-scripts/sdxl_train_network.py - 优化器:Prodigy
- 状态保存/恢复路径:
/workspace/LoRA/LoRA_Output/Navya/(已确认路径下存在last-state子目录,我指定的恢复路径是/workspace/LoRA/LoRA_Output/Navya/at-step00000500-state)
我的训练启动Python代码如下:
from Tools.check_dependencys import check_dependencies from Tools.makedir import makedir import os print("====================================") print("🚀 LoRA Training with Realviz XL 5.0") print("Checking Dependencies...") print("====================================") check_dependencies() print("====================================") print("🚀 Setting up directories...") print("====================================") DATASET_DIR, REG_DIR ,LOG_DIR = makedir() OUTPUT_DIR = "/workspace/LoRA/LoRA_Output/Navya/" print("====================================") print("Training Configuration") print("====================================") PRETRAINED_MODEL = "/workspace/RealVisXL_V5.0" # ---------------------- # RESUMPTION PARAMETERS # ---------------------- RESUME_PATH = f"/workspace/LoRA/LoRA_Output/Navya/at-step00000500-state" STARTING_STEP = 500 # 上次训练停在的步数 # ---------------------- # TRAINING COMMAND # ---------------------- RESOLUTION = 1024 BATCH_SIZE = 4 GRAD_ACC_STEPS = 1 MAX_STEPS = 600 NETWORK_DIM = 96 NETWORK_ALPHA = 96 LEARNING_RATE = 0.7 train_cmd = f''' accelerate launch --mixed_precision=bf16 /workspace/kohya_ss/sd-scripts/sdxl_train_network.py \\ --pretrained_model_name_or_path="{PRETRAINED_MODEL}" \\ --train_data_dir={DATASET_DIR} \\ --reg_data_dir="{REG_DIR}" \\ --output_dir="{OUTPUT_DIR}" \\ --logging_dir="{LOG_DIR}" \\ --resolution={RESOLUTION} \\ --network_module=networks.lora \\ --network_dim={NETWORK_DIM} \\ --network_alpha={NETWORK_ALPHA} \\ --learning_rate={LEARNING_RATE} \\ --train_batch_size={BATCH_SIZE} \\ --gradient_accumulation_steps={GRAD_ACC_STEPS} \\ --max_train_steps={MAX_STEPS} \\ --save_every_n_steps=150 \\ --text_encoder_lr=0.7 \\ --noise_offset=0.1 \\ --min_snr_gamma=5 \\ --save_last_n_steps=3 \\ --save_last_n_epochs=3 \\ --save_state \\ --save_precision=bf16 \\ --optimizer_type=Prodigy \\ --mem_eff_attn \\ --caption_extension=.txt \\ --max_data_loader_n_workers=4 \\ --log_prefix="LoRA_Logs" \\ --enable_bucket \\ --bucket_reso_steps=64 \\ --log_with tensorboard \\ --resume="{RESUME_PATH}" \\ 2>&1 | tee /workspace/train.log ''' print("====================================") print("🚀 Starting Training...") print("====================================") print("🚀 Resuming LoRA training with Colab Pro A100/L4...\n") exit_code = os.system(train_cmd) print("\n✅ Training finished with exit code:", exit_code)
日志中的矛盾现象
1. 状态加载成功的日志记录
恢复训练时,日志明确提示找到了状态文件并加载了模型权重和优化器状态:
2025-10-05 12:50:23 INFO resume training from local state: /workspace/LoRA/LoRA_Output/Navya/
train_util.py:4684 INFO Loading states from /workspace/LoRA/LoRA_Output/Navya/
accelerator.py:3678 INFO All model weights loaded successfully
INFO All optimizer states loaded successfully
2. 步数与epoch却重置的现象
进度条和epoch计数完全从头开始,没有延续之前的500步进度:
epoch 0/700 # 应该从对应epoch继续,而不是0
...
steps: 0%| | 3/700 [00:20<1:19:40, 6.86s/it, avr_loss=0.0538]
完整日志里还附带了accelerate的默认参数警告,但看起来不影响状态加载:
The following values were not passed to
accelerate launchand had defaults used instead:--num_processeswas set to a value of1--num_machineswas set to a value of1--dynamo_backendwas set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config.
求助问题
明明日志显示状态已经加载成功,为什么训练步数还是会重置到0?有没有办法让训练从上次保存的500步继续推进,而不是从头开始?麻烦各位大佬帮忙看看哪里出问题了,谢谢!




