如何让Google Colab的GPU持续运行超12小时以完成模型训练?
Ah, the infamous Colab GPU timeout—trust me, I’ve lost count of how many late-night training runs I’ve had die right at the 12-hour mark. Let’s go through practical, actionable ways to work around this and get your model fully trained:
1. Upgrade to Colab Pro/Pro+ (Most Reliable Official Fix)
If you’re on the free tier, Google enforces a strict 12-hour GPU session limit. Colab Pro bumps this up to 24 hours per session, and Pro+ gives even longer windows plus priority access to faster GPUs. It’s a paid option, but it eliminates the hassle of workarounds if you regularly run long training jobs.
2. Implement Checkpointing (Non-Negotiable for Long Runs)
Even if you can’t extend the session, you can pick up right where you left off by saving model checkpoints at regular intervals. Here’s how to do it for the two most common frameworks:
TensorFlow/Keras
Use the ModelCheckpoint callback to save full models or weights after every epoch:
from tensorflow.keras.callbacks import ModelCheckpoint from google.colab import drive drive.mount('/content/drive') # Save checkpoints to your Google Drive so they persist after disconnect checkpoint_path = "/content/drive/MyDrive/colab_checkpoints/model_epoch_{epoch}.h5" checkpoint_callback = ModelCheckpoint( filepath=checkpoint_path, save_weights_only=False, # Set to True if you only want weights save_freq="epoch", # Save after every epoch verbose=1 ) # Pass the callback to model.fit() model.fit( train_data, epochs=100, validation_data=val_data, callbacks=[checkpoint_callback] )
To resume training later:
from tensorflow.keras.models import load_model model = load_model("/content/drive/MyDrive/colab_checkpoints/model_epoch_50.h5") # Continue training from epoch 51 model.fit(train_data, initial_epoch=50, epochs=100, ...)
PyTorch
Manually save model state, optimizer state, and epoch number at regular intervals:
import torch from google.colab import drive drive.mount('/content/drive') # Define checkpoint save function def save_checkpoint(epoch, model, optimizer, loss): checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss } torch.save(checkpoint, f"/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_{epoch}.pth") print(f"Checkpoint saved for epoch {epoch}") # In your training loop for epoch in range(100): # ... training steps ... loss = ... # Your training loss # Save every 5 epochs (adjust as needed) if epoch % 5 == 0: save_checkpoint(epoch, model, optimizer, loss)
To resume:
checkpoint = torch.load("/content/drive/MyDrive/colab_checkpoints/checkpoint_epoch_50.pth") model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) start_epoch = checkpoint['epoch'] + 1 # Continue training from start_epoch for epoch in range(start_epoch, 100): # ... training steps ...
3. Prevent Idle Disconnects
Colab will disconnect your session if the browser tab is inactive or there’s no user interaction for too long. To avoid this:
- Browser Console Script: Open your browser’s developer tools (F12), go to the Console tab, and run this snippet to simulate periodic clicks:
function keepSessionAlive() { console.log("Keeping Colab session alive..."); document.querySelector("#top-toolbar > colab-connect-button").click(); } setInterval(keepSessionAlive, 60000); // Click every minute - Colab Cell Loop: Run a background cell that prints updates to keep the session active:
import time while True: print(f"Session active at: {time.ctime()}") time.sleep(300) # Print every 5 minutes
Note: This only prevents idle timeouts—it won’t bypass the 12-hour hard limit for free tiers.
4. Split Your Training Task
If upgrading isn’t an option, break your training into smaller chunks:
- Train in stages: For example, first train the base layers of a transfer learning model, save weights, then load and train the top layers in a new session.
- Use incremental learning: Train on subsets of your data sequentially, updating the model each time and saving checkpoints between subsets.
5. Always Mount Google Drive
Make sure to mount your Google Drive at the start of every session. This ensures your checkpoints, datasets, and trained models aren’t lost when the session disconnects. The code to mount is simple:
from google.colab import drive drive.mount('/content/drive')
内容的提问来源于stack exchange,提问作者shivin saluja




