如何让Google Colab的GPU持续运行超12小时以完成模型训练？

阿华AIGC实验室

2026-5-21

Fixing Colab GPU 12-Hour Disconnect Issues for Long Training Runs

Ah, the infamous Colab GPU timeout—trust me, I’ve lost count of how many late-night training runs I’ve had die right at the 12-hour mark. Let’s go through practical, actionable ways to work around this and get your model fully trained:

1. Upgrade to Colab Pro/Pro+ (Most Reliable Official Fix)

If you’re on the free tier, Google enforces a strict 12-hour GPU session limit. Colab Pro bumps this up to 24 hours per session, and Pro+ gives even longer windows plus priority access to faster GPUs. It’s a paid option, but it eliminates the hassle of workarounds if you regularly run long training jobs.

2. Implement Checkpointing (Non-Negotiable for Long Runs)

Even if you can’t extend the session, you can pick up right where you left off by saving model checkpoints at regular intervals. Here’s how to do it for the two most common frameworks:

TensorFlow/Keras

Use the ModelCheckpoint callback to save full models or weights after every epoch:

from tensorflow.keras.callbacks import ModelCheckpoint
from google.colab import drive
drive.mount('/content/drive')

# Save checkpoints to your Google Drive so they persist after disconnect
checkpoint_path = "/content/drive/MyDrive/colab_checkpoints/model_epoch_{epoch}.h5"
checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_path,
    save_weights_only=False,  # Set to True if you only want weights
    save_freq="epoch",        # Save after every epoch
    verbose=1
)

# Pass the callback to model.fit()
model.fit(
    train_data,
    epochs=100,
    validation_data=val_data,
    callbacks=[checkpoint_callback]
)

To resume training later:

from tensorflow.keras.models import load_model
model = load_model("/content/drive/MyDrive/colab_checkpoints/model_epoch_50.h5")
# Continue training from epoch 51
model.fit(train_data, initial_epoch=50, epochs=100, ...)

PyTorch

Manually save model state, optimizer state, and epoch number at regular intervals:

import torch
from google.colab import drive
drive.mount('/content/drive')

# Define checkpoint save function
def save_checkpoint(epoch, model, optimizer, loss):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss
    }
    torch.save(checkpoint, f"/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_{epoch}.pth")
    print(f"Checkpoint saved for epoch {epoch}")

# In your training loop
for epoch in range(100):
    # ... training steps ...
    loss = ...  # Your training loss
    
    # Save every 5 epochs (adjust as needed)
    if epoch % 5 == 0:
        save_checkpoint(epoch, model, optimizer, loss)

To resume:

checkpoint = torch.load("/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_50.pth")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch'] + 1

# Continue training from start_epoch
for epoch in range(start_epoch, 100):
    # ... training steps ...

3. Prevent Idle Disconnects

Colab will disconnect your session if the browser tab is inactive or there’s no user interaction for too long. To avoid this:

Browser Console Script: Open your browser’s developer tools (F12), go to the Console tab, and run this snippet to simulate periodic clicks:

function keepSessionAlive() {
    console.log("Keeping Colab session alive...");
    document.querySelector("#top-toolbar > colab-connect-button").click();
}
setInterval(keepSessionAlive, 60000); // Click every minute

Colab Cell Loop: Run a background cell that prints updates to keep the session active:

import time
while True:
    print(f"Session active at: {time.ctime()}")
    time.sleep(300)  # Print every 5 minutes

Note: This only prevents idle timeouts—it won’t bypass the 12-hour hard limit for free tiers.

4. Split Your Training Task

If upgrading isn’t an option, break your training into smaller chunks:

Train in stages: For example, first train the base layers of a transfer learning model, save weights, then load and train the top layers in a new session.
Use incremental learning: Train on subsets of your data sequentially, updating the model each time and saving checkpoints between subsets.

5. Always Mount Google Drive

Make sure to mount your Google Drive at the start of every session. This ensures your checkpoints, datasets, and trained models aren’t lost when the session disconnects. The code to mount is simple:

from google.colab import drive
drive.mount('/content/drive')

内容的提问来源于stack exchange，提问作者shivin saluja

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴