GPU设备级事件追踪采集需求及CPU场景类比说明

阿华AIGC实验室

2026-5-21

GPU Device-Level Event Tracing with Timestamping: A CPU Scheduler Analog

Let me start by mapping your CPU scheduler scenario over to the GPU world first—since the analogy helps frame exactly what we're chasing. On CPUs, you’re tracking interrupts, kernel processes, and user process preemption via kernel patches. For GPUs, the equivalent "preemption" and device-level events include:

User-space kernel launches from different processes competing for streaming multiprocessors (SMs, NVIDIA) or compute units (CUs, AMD)
Driver-initiated background tasks (e.g., memory page migrations, ECC error handling)
Hardware interrupts (power management triggers, thermal throttling, hardware errors)
vGPU context switches in virtualized environments

Here are the most practical approaches to capture these events with precise timestamps, ordered by ease of implementation:

1. Vendor-Native Profiling Tools (No Kernel Patching Required)

These tools are built to capture device-wide GPU events out of the box, with timestamp accuracy down to nanoseconds in most cases:

NVIDIA Nsight Systems: Run nsys profile --trace=cuda,nvtx,osrt to capture not just kernel launches/memory copies, but also driver-level events and hardware scheduling decisions. The GUI visualizes how different processes' GPU tasks preempt each other, along with exact start/end timestamps. You can filter to view device-wide activity instead of just per-process data.
AMD ROCtracer/ROCprof: Use rocprof --hip-trace or the ROCtracer API to capture HIP kernel launches, runtime events, and hardware-level scheduling events. It supports device-wide tracing to spot cross-process GPU resource contention.
Intel VTune Profiler: For Xe GPUs, enable the "GPU Hardware Events" and "GPU Runtime" traces to capture device-level task scheduling, interrupts, and timestamped event sequences.

2. Programmatic Tracing (Custom Event Markers)

If you need to embed custom timestamped markers alongside device-level events, use vendor-specific APIs:

NVIDIA CUDA Events: While cudaEventRecord is per-process, you can combine it with driver-level APIs like NVML to correlate your process's GPU events with global device state. For example, use nvmlDeviceGetClockInfo to sync GPU and CPU timestamps, then log when your kernel starts/stops relative to other device activity.
AMD ADL/ROCm APIs: Use amdgpu_query_info to get device-wide event counters and timestamps, then pair them with HIP event markers to track cross-process contention.

3. Kernel-Level Hooks/Patching (Your CPU Patch Equivalent)

If you need full visibility into every device-level event (including internal driver tasks and hardware interrupts), you’ll need to hook or patch the GPU kernel driver:

Linux GPU Driver Hooks: For NVIDIA's nvidia.ko or AMD's amdgpu.ko, write a loadable kernel module to hook key scheduling functions. For example:
- On NVIDIA, hook nv_sched_submit_job to log when a GPU job is submitted, along with its PID, task ID, and CPU timestamp.
- On AMD, hook amdgpu_job_submit to track job queuing and execution timestamps.
Timestamp Synchronization: Critical to align GPU hardware timestamps (read via vendor APIs like cudaDeviceGetTimestamp) with CPU timestamps. Most drivers provide a way to sync these clocks to avoid drift.
Note: This approach requires deep knowledge of the GPU driver's internal structure, and will break when drivers are updated—plan for maintenance overhead.

4. Virtualized vGPU Tracing

If you're working with vGPUs (e.g., NVIDIA vGPU, AMD MxGPU), you’ll need to combine hypervisor-level tracing with vGPU driver hooks:

For KVM-based vGPUs, modify the vGPU front-end/back-end driver code to log context switches between VMs, along with timestamped event data.
Use hypervisor tools like perf to correlate CPU-level VM scheduling with GPU vGPU task execution.

Final Notes

Start with vendor tools if you can—they’re maintained, accurate, and require no low-level coding. Reserve kernel-level patching for cases where you need visibility into events that tools don’t expose (e.g., internal driver interrupt handling). Always validate timestamp accuracy by syncing GPU and CPU clocks to avoid misinterpreting event sequences.

内容的提问来源于stack exchange，提问作者complextea