如何基于GPU使用量实现GCP AI Platform Unified自动扩缩容？

阿华AIGC实验室

2026-4-29

Great question—this is a common gotcha with Vertex AI (formerly AI Platform Unified) prediction services when using GPUs. The default autoscaling logic prioritizes CPU utilization, which can leave you stuck with maxed-out GPUs but no new nodes spinning up. Here's how to set up GPU-based autoscaling to fix this:

Solution: GPU-Driven Autoscaling for Vertex AI Predictions

Step 1: Confirm GPU Metrics Are Being Tracked
Vertex AI automatically exports GPU utilization metrics to Cloud Monitoring once you deploy a GPU-backed prediction service. The key metric you’ll use is aiplatform.googleapis.com/prediction/accelerator/utilization. You can verify this exists by checking the Cloud Console’s Monitoring > Metrics Explorer and filtering for that metric name.

Step 2: Deploy a Custom Horizontal Pod Autoscaler (HPA)
The default autoscaling policy doesn’t use GPU metrics, so you’ll need to define a custom HPA that targets GPU utilization. Create a YAML file (e.g., gpu-autoscaler.yaml) with the following configuration, then apply it with kubectl apply -f gpu-autoscaler.yaml:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-prediction-autoscaler
  namespace: <your-namespace> # Usually "default" unless you set a custom one
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: <your-prediction-deployment> # Get this via `kubectl get deployments`
  minReplicas: 1
  maxReplicas: 10 # Adjust to match your GPU quota and workload needs
  metrics:
  - type: Pods
    pods:
      metric:
        name: aiplatform.googleapis.com/prediction/accelerator/utilization
      target:
        type: AverageValue
        averageValue: 70 # Target GPU utilization percentage (tweak as needed)

Replace the placeholders with your actual resource names. To find your prediction deployment name, run kubectl get deployments in the GKE cluster linked to your Vertex AI service.

Step 3: Validate and Test the Autoscaler
After applying the HPA, check its status with kubectl describe hpa gpu-prediction-autoscaler. You should see the GPU metric listed under the "Metrics" section. Then, simulate a workload that pushes GPU usage above your target threshold—you should observe new replicas spinning up within 2-5 minutes as the autoscaler reacts.
Important Considerations
- Ensure your GKE cluster has enough GPU quota to support the maxReplicas you set. If not, you’ll need to request a quota increase from GCP.
- The HPA relies on Cloud Monitoring metrics, so confirm your cluster has the necessary IAM permissions to access these metrics (the default Vertex AI setup includes this, but double-check if you hit permission errors).
- You can combine GPU and CPU metrics in the HPA if you want dual triggers—just add an additional metric block for CPU utilization alongside the GPU one.