如何通过Profiling+OpenCL+SYCL+DPCPP测量GPU执行时间？关于Intel PTI-GPU Device Activity Tracing的使用困惑及CPU/GPU性能统一Profiling需求咨询

阿华AIGC实验室

2026-4-29

How to Measure CPU & GPU Execution Time with Intel PTI-GPU Device Activity Tracing

Hey there! I see you're trying to profile both CPU (TBB) and GPU (SYCL) execution times, and you're confused about leveraging Intel PTI-GPU's Device Activity Tracing for the GPU side. Let's break this down and fix your code to get accurate, simultaneous measurements.

First, Fix Small Issues in Your Current Code

Before diving into profiling, let's address a couple of bugs that might cause unexpected behavior:

Loop Bound Error: In main(), your loop condition i < sizeof(functions) is incorrect. sizeof(functions) returns the total byte size of the array, not the number of elements. Change it to:
```
for(int i = 0; i < sizeof(functions)/sizeof(func); i++)
```
GPU Branch Missing Timing: Right now, your GPU path doesn't track execution time at all. We'll fix this by integrating PTI-GPU profiling.

Using Intel PTI-GPU for GPU Execution Time Profiling

Intel PTI-GPU's Device Activity Tracing captures low-level GPU events (like kernel execution, memory transfers) with precise hardware timestamps. Here's how to integrate it into your SYCL code:

Step 1: Initialize PTI Profiler

Include PTI headers and set up a controller to monitor GPU activity. We'll use a callback function to capture kernel start/end events.

Step 2: Capture GPU Kernel Timestamps

The callback will record timestamps when a GPU kernel starts and finishes, allowing us to calculate exact execution time.

Modified Code with Integrated Profiling

Here's your updated code that measures both TBB CPU time (via chrono) and SYCL GPU time (via PTI-GPU):

#include <CL/sycl.hpp>
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <vector>
#include <string>
#include <queue>
#include <tbb/blocked_range.h>
#include <tbb/global_control.h>
#include <chrono>
#include <pti/pti.h>  // Include PTI headers

using namespace tbb;
using namespace sycl;

// PTI Profiler variables
pti::device_activity::Controller* controller = nullptr;
uint64_t gpu_start_time = 0;
uint64_t gpu_end_time = 0;

// Callback to capture GPU kernel start/end events
void device_activity_callback(const pti::device_activity::Event& event, void*) {
    switch (event.type) {
        case pti::device_activity::EventType::KERNEL_START:
            gpu_start_time = event.timestamp;
            break;
        case pti::device_activity::EventType::KERNEL_END:
            gpu_end_time = event.timestamp;
            break;
        default:
            break;
    }
}

template<class Tin, class Tout, class Function>
class Map {
private:
    Function fun;
public:
    Map() {}
    Map(Function f):fun(f) {}
    std::vector<Tout> operator()(bool use_tbb, std::vector<Tin>& v) {
        std::vector<Tout> r(v.size());
        if(use_tbb){
            // Measure CPU execution time with chrono
            auto begin = std::chrono::high_resolution_clock::now();
            tbb::parallel_for(tbb::blocked_range<size_t>(0, v.size()), [&](tbb::blocked_range<size_t> t) {
                for (size_t index = t.begin(); index < t.end(); ++index){
                    r[index] = fun(v[index]);
                }
            });
            auto end = std::chrono::high_resolution_clock::now();
            auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin);
            printf("CPU (TBB) Time measured: %.3f seconds.\n", elapsed.count() * 1e-9);
            return r;
        } else {
            sycl::queue gpuQueue{sycl::gpu_selector()};
            sycl::range<1> n_item{v.size()};
            sycl::buffer<Tin, 1> in_buffer(v.data(), n_item);
            sycl::buffer<Tout, 1> out_buffer(r.data(), n_item);

            // Start PTI device activity tracing
            controller = new pti::device_activity::Controller(
                pti::device_activity::Domain::OPENCL,  // SYCL uses OpenCL backend for Intel GPUs
                device_activity_callback,
                nullptr
            );

            // Submit GPU kernel
            gpuQueue.submit([&](sycl::handler& h){
                auto f = fun;
                sycl::accessor in_accessor(in_buffer, h, sycl::read_only);
                sycl::accessor out_accessor(out_buffer, h, sycl::write_only);
                h.parallel_for(n_item, [=](sycl::id<1> index) {
                    out_accessor[index] = f(in_accessor[index]);
                });
            }).wait();

            // Stop tracing and calculate GPU time
            delete controller;
            double gpu_elapsed = (gpu_end_time - gpu_start_time) * 1e-9;  // Convert nanoseconds to seconds
            printf("GPU (SYCL) Time measured: %.3f seconds.\n", gpu_elapsed);
        }
        return r;
    }
};

template<class Tin, class Tout, class Function>
Map<Tin, Tout, Function> make_map(Function f) {
    return Map<Tin, Tout, Function>(f);
}

typedef int(*func)(int x);

// Define different functions
auto function = [](int x){ return x; };
auto functionTimesTwo = [](int x){ return (x*2); };
auto functionDivideByTwo = [](int x){ return (x/2); };
auto lambdaFunction = [](int x){return (++x);};

int main(int argc, char *argv[]) {
    std::vector<int> v = {1,2,3,4,5,6,7,8,9};
    func functions[] = { function, functionTimesTwo, functionDivideByTwo, lambdaFunction };

    // Fixed loop bound to avoid out-of-bounds access
    for(int i = 0; i < sizeof(functions)/sizeof(func); i++){
        auto m1 = make_map<int, int>(functions[i]);
        
        // Test CPU path
        std::cout << "\n--- CPU (TBB) Execution for Function " << i << " ---" << std::endl;
        std::vector<int> r_cpu = m1(true, v);
        for(auto &e: r_cpu) {
            std::cout << e << " ";
        }
        std::cout << std::endl;

        // Test GPU path
        std::cout << "\n--- GPU (SYCL) Execution for Function " << i << " ---" << std::endl;
        std::vector<int> r_gpu = m1(false, v);
        for(auto &e: r_gpu) {
            std::cout << e << " ";
        }
        std::cout << std::endl;
    }
    return 0;
}

Key Notes for Compilation & Usage

Link PTI Libraries: When compiling, link against Intel PTI-GPU libraries. For example with GCC:

g++ -std=c++17 your_code.cpp -o profiling_app -ltbb -lsycl -lpti_device_activity -lpti_common

Timestamp Units: PTI's timestamps are in nanoseconds, so we convert them to seconds for readability.
SYCL Compatibility: Intel SYCL implementations use OpenCL under the hood, so we use the OPENCL domain in PTI's controller.

Why This Works

CPU Measurement: We keep using std::chrono for TBB loops, which gives accurate wall-clock time for CPU execution.
GPU Measurement: PTI's Device Activity Tracing hooks directly into the GPU runtime to capture kernel start/end events from hardware, providing precise timestamps that reflect actual GPU execution time (not just host-side waiting time).

内容的提问来源于stack exchange，提问作者Sahar Sa