如何通过Profiling+OpenCL+SYCL+DPCPP测量GPU执行时间?关于Intel PTI-GPU Device Activity Tracing的使用困惑及CPU/GPU性能统一Profiling需求咨询
Hey there! I see you're trying to profile both CPU (TBB) and GPU (SYCL) execution times, and you're confused about leveraging Intel PTI-GPU's Device Activity Tracing for the GPU side. Let's break this down and fix your code to get accurate, simultaneous measurements.
First, Fix Small Issues in Your Current Code
Before diving into profiling, let's address a couple of bugs that might cause unexpected behavior:
- Loop Bound Error: In
main(), your loop conditioni < sizeof(functions)is incorrect.sizeof(functions)returns the total byte size of the array, not the number of elements. Change it to:for(int i = 0; i < sizeof(functions)/sizeof(func); i++) - GPU Branch Missing Timing: Right now, your GPU path doesn't track execution time at all. We'll fix this by integrating PTI-GPU profiling.
Using Intel PTI-GPU for GPU Execution Time Profiling
Intel PTI-GPU's Device Activity Tracing captures low-level GPU events (like kernel execution, memory transfers) with precise hardware timestamps. Here's how to integrate it into your SYCL code:
Step 1: Initialize PTI Profiler
Include PTI headers and set up a controller to monitor GPU activity. We'll use a callback function to capture kernel start/end events.
Step 2: Capture GPU Kernel Timestamps
The callback will record timestamps when a GPU kernel starts and finishes, allowing us to calculate exact execution time.
Modified Code with Integrated Profiling
Here's your updated code that measures both TBB CPU time (via chrono) and SYCL GPU time (via PTI-GPU):
#include <CL/sycl.hpp> #include <iostream> #include <tbb/tbb.h> #include <tbb/parallel_for.h> #include <vector> #include <string> #include <queue> #include <tbb/blocked_range.h> #include <tbb/global_control.h> #include <chrono> #include <pti/pti.h> // Include PTI headers using namespace tbb; using namespace sycl; // PTI Profiler variables pti::device_activity::Controller* controller = nullptr; uint64_t gpu_start_time = 0; uint64_t gpu_end_time = 0; // Callback to capture GPU kernel start/end events void device_activity_callback(const pti::device_activity::Event& event, void*) { switch (event.type) { case pti::device_activity::EventType::KERNEL_START: gpu_start_time = event.timestamp; break; case pti::device_activity::EventType::KERNEL_END: gpu_end_time = event.timestamp; break; default: break; } } template<class Tin, class Tout, class Function> class Map { private: Function fun; public: Map() {} Map(Function f):fun(f) {} std::vector<Tout> operator()(bool use_tbb, std::vector<Tin>& v) { std::vector<Tout> r(v.size()); if(use_tbb){ // Measure CPU execution time with chrono auto begin = std::chrono::high_resolution_clock::now(); tbb::parallel_for(tbb::blocked_range<size_t>(0, v.size()), [&](tbb::blocked_range<size_t> t) { for (size_t index = t.begin(); index < t.end(); ++index){ r[index] = fun(v[index]); } }); auto end = std::chrono::high_resolution_clock::now(); auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin); printf("CPU (TBB) Time measured: %.3f seconds.\n", elapsed.count() * 1e-9); return r; } else { sycl::queue gpuQueue{sycl::gpu_selector()}; sycl::range<1> n_item{v.size()}; sycl::buffer<Tin, 1> in_buffer(v.data(), n_item); sycl::buffer<Tout, 1> out_buffer(r.data(), n_item); // Start PTI device activity tracing controller = new pti::device_activity::Controller( pti::device_activity::Domain::OPENCL, // SYCL uses OpenCL backend for Intel GPUs device_activity_callback, nullptr ); // Submit GPU kernel gpuQueue.submit([&](sycl::handler& h){ auto f = fun; sycl::accessor in_accessor(in_buffer, h, sycl::read_only); sycl::accessor out_accessor(out_buffer, h, sycl::write_only); h.parallel_for(n_item, [=](sycl::id<1> index) { out_accessor[index] = f(in_accessor[index]); }); }).wait(); // Stop tracing and calculate GPU time delete controller; double gpu_elapsed = (gpu_end_time - gpu_start_time) * 1e-9; // Convert nanoseconds to seconds printf("GPU (SYCL) Time measured: %.3f seconds.\n", gpu_elapsed); } return r; } }; template<class Tin, class Tout, class Function> Map<Tin, Tout, Function> make_map(Function f) { return Map<Tin, Tout, Function>(f); } typedef int(*func)(int x); // Define different functions auto function = [](int x){ return x; }; auto functionTimesTwo = [](int x){ return (x*2); }; auto functionDivideByTwo = [](int x){ return (x/2); }; auto lambdaFunction = [](int x){return (++x);}; int main(int argc, char *argv[]) { std::vector<int> v = {1,2,3,4,5,6,7,8,9}; func functions[] = { function, functionTimesTwo, functionDivideByTwo, lambdaFunction }; // Fixed loop bound to avoid out-of-bounds access for(int i = 0; i < sizeof(functions)/sizeof(func); i++){ auto m1 = make_map<int, int>(functions[i]); // Test CPU path std::cout << "\n--- CPU (TBB) Execution for Function " << i << " ---" << std::endl; std::vector<int> r_cpu = m1(true, v); for(auto &e: r_cpu) { std::cout << e << " "; } std::cout << std::endl; // Test GPU path std::cout << "\n--- GPU (SYCL) Execution for Function " << i << " ---" << std::endl; std::vector<int> r_gpu = m1(false, v); for(auto &e: r_gpu) { std::cout << e << " "; } std::cout << std::endl; } return 0; }
Key Notes for Compilation & Usage
- Link PTI Libraries: When compiling, link against Intel PTI-GPU libraries. For example with GCC:
g++ -std=c++17 your_code.cpp -o profiling_app -ltbb -lsycl -lpti_device_activity -lpti_common - Timestamp Units: PTI's timestamps are in nanoseconds, so we convert them to seconds for readability.
- SYCL Compatibility: Intel SYCL implementations use OpenCL under the hood, so we use the
OPENCLdomain in PTI's controller.
Why This Works
- CPU Measurement: We keep using
std::chronofor TBB loops, which gives accurate wall-clock time for CPU execution. - GPU Measurement: PTI's Device Activity Tracing hooks directly into the GPU runtime to capture kernel start/end events from hardware, providing precise timestamps that reflect actual GPU execution time (not just host-side waiting time).
内容的提问来源于stack exchange,提问作者Sahar Sa




