You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何让Google Colab的GPU持续运行超12小时以完成模型训练?

Fixing Colab GPU 12-Hour Disconnect Issues for Long Training Runs

Ah, the infamous Colab GPU timeout—trust me, I’ve lost count of how many late-night training runs I’ve had die right at the 12-hour mark. Let’s go through practical, actionable ways to work around this and get your model fully trained:

1. Upgrade to Colab Pro/Pro+ (Most Reliable Official Fix)

If you’re on the free tier, Google enforces a strict 12-hour GPU session limit. Colab Pro bumps this up to 24 hours per session, and Pro+ gives even longer windows plus priority access to faster GPUs. It’s a paid option, but it eliminates the hassle of workarounds if you regularly run long training jobs.

2. Implement Checkpointing (Non-Negotiable for Long Runs)

Even if you can’t extend the session, you can pick up right where you left off by saving model checkpoints at regular intervals. Here’s how to do it for the two most common frameworks:

TensorFlow/Keras

Use the ModelCheckpoint callback to save full models or weights after every epoch:

from tensorflow.keras.callbacks import ModelCheckpoint
from google.colab import drive
drive.mount('/content/drive')

# Save checkpoints to your Google Drive so they persist after disconnect
checkpoint_path = "/content/drive/MyDrive/colab_checkpoints/model_epoch_{epoch}.h5"
checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_path,
    save_weights_only=False,  # Set to True if you only want weights
    save_freq="epoch",        # Save after every epoch
    verbose=1
)

# Pass the callback to model.fit()
model.fit(
    train_data,
    epochs=100,
    validation_data=val_data,
    callbacks=[checkpoint_callback]
)

To resume training later:

from tensorflow.keras.models import load_model
model = load_model("/content/drive/MyDrive/colab_checkpoints/model_epoch_50.h5")
# Continue training from epoch 51
model.fit(train_data, initial_epoch=50, epochs=100, ...)

PyTorch

Manually save model state, optimizer state, and epoch number at regular intervals:

import torch
from google.colab import drive
drive.mount('/content/drive')

# Define checkpoint save function
def save_checkpoint(epoch, model, optimizer, loss):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss
    }
    torch.save(checkpoint, f"/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_{epoch}.pth")
    print(f"Checkpoint saved for epoch {epoch}")

# In your training loop
for epoch in range(100):
    # ... training steps ...
    loss = ...  # Your training loss
    
    # Save every 5 epochs (adjust as needed)
    if epoch % 5 == 0:
        save_checkpoint(epoch, model, optimizer, loss)

To resume:

checkpoint = torch.load("/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_50.pth")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch'] + 1

# Continue training from start_epoch
for epoch in range(start_epoch, 100):
    # ... training steps ...

3. Prevent Idle Disconnects

Colab will disconnect your session if the browser tab is inactive or there’s no user interaction for too long. To avoid this:

  • Browser Console Script: Open your browser’s developer tools (F12), go to the Console tab, and run this snippet to simulate periodic clicks:
    function keepSessionAlive() {
        console.log("Keeping Colab session alive...");
        document.querySelector("#top-toolbar > colab-connect-button").click();
    }
    setInterval(keepSessionAlive, 60000); // Click every minute
    
  • Colab Cell Loop: Run a background cell that prints updates to keep the session active:
    import time
    while True:
        print(f"Session active at: {time.ctime()}")
        time.sleep(300)  # Print every 5 minutes
    

Note: This only prevents idle timeouts—it won’t bypass the 12-hour hard limit for free tiers.

4. Split Your Training Task

If upgrading isn’t an option, break your training into smaller chunks:

  • Train in stages: For example, first train the base layers of a transfer learning model, save weights, then load and train the top layers in a new session.
  • Use incremental learning: Train on subsets of your data sequentially, updating the model each time and saving checkpoints between subsets.

5. Always Mount Google Drive

Make sure to mount your Google Drive at the start of every session. This ensures your checkpoints, datasets, and trained models aren’t lost when the session disconnects. The code to mount is simple:

from google.colab import drive
drive.mount('/content/drive')

内容的提问来源于stack exchange,提问作者shivin saluja

火山引擎 最新活动