GPU中L2事务如何映射至DRAM？含总线适配与测量疑问

阿华AIGC实验室

2026-5-28

GPU L2 Cache Transactions, DRAM Mapping, and Transaction Measurement

Great question—this is a common point of confusion when diving into GPU memory performance, since the bridge between L2 cache operations and actual DRAM access depends heavily on the memory controller and underlying DRAM architecture. Let’s break this down step by step:

1. How L2 Transactions Map to DRAM Access

First, let’s clarify the layers at play:

The gst_transactions/gld_transactions metrics from nvprof count L2 cache ↔ memory controller transactions, which come in 32B/64B/128B sizes. This is a GPU-internal transfer granularity defined by the L2 cache’s design.
The memory controller acts as a translator: it takes these L2 transactions and converts them into DRAM-native requests, which are tied to two key DRAM properties:
- Bus width: The total data width of the GPU’s global memory bus (e.g., 384 bits for Titan Xp, 3072 bits for P100).
- Burst length (BL): The number of consecutive clock cycles a DRAM chip transfers data after a row activation (e.g., 8 for GDDR5X, variable for HBM2).

For example:

A 64B L2 read transaction on Titan Xp (384-bit = 48B per cycle bus):
The DRAM’s burst length is 8, so a single row activation triggers 8 cycles of transfer, moving 48B * 8 = 384B total. The memory controller will read this full burst, then return only the 64B needed by L2—any unused data may be cached in the memory controller for future adjacent requests (to avoid re-activating the same row).
If multiple L2 transactions target the same DRAM row, the memory controller will merge them into a single DRAM request, reducing row activation overhead and improving efficiency. If a transaction crosses a DRAM row boundary, it gets split into multiple row requests.

2. Adapting to Different Bus Widths (Titan Xp vs. P100)

Let’s use your examples to make this concrete:

Titan Xp (GDDR5X, 384-bit bus):
Its narrower bus means the memory controller needs more cycles to transfer a given L2 transaction size. For a 128B transaction, it would need ~3 bus cycles (48B * 3 = 144B) to cover the 128B, but since DRAM requires full bursts, it’ll pull the entire 384B burst and cache the excess.
P100 (HBM2, 3072-bit bus):
3072 bits = 384B per cycle—so a single bus cycle can cover a 128B L2 transaction. HBM2 also supports variable burst lengths, so the memory controller can truncate bursts to match the exact L2 transaction size (or merge multiple small transactions into a single burst to leverage the bus’s massive bandwidth).

The core idea is that the memory controller abstracts L2’s transaction size from DRAM’s physical constraints, using merging, caching, and burst adjustment to maximize bandwidth utilization regardless of bus width.

3. Measuring DRAM Controller-Generated Transactions

You’re right that dram_read_transactions won’t show you physical DRAM bus activity—this metric counts logical requests sent from the memory controller to DRAM, which often align with L2 transaction counts (hence why you saw the same number on Titan Xp and P100 during sequential access).

To measure actual physical DRAM transactions or bus activity, use these approaches:

Count row activations: Use the dram_row_activate metric. This counts how many times DRAM rows are activated, which is a direct measure of physical DRAM operations. Wider buses (like P100’s) will have fewer row activations for the same total data, since each activation transfers more data.
Calculate physical burst transfers: Take the dram_read_data metric (total bytes read from DRAM) and divide by the burst size of your GPU’s DRAM. Burst size = (bus width / 8) * burst length. For Titan Xp: (384/8)*8 = 384B per burst; for P100: (3072/8)*8 = 3072B per burst (adjust burst length if your HBM2 uses a different value like 16).
Measure transaction merging efficiency: Compare l2_read_transactions and dram_read_transactions. A large gap means the memory controller is merging multiple L2 transactions into fewer DRAM requests, which is good for performance.

Quick Note on `dram_read_transactions`

The consistency you saw in sequential access makes sense: when accessing memory in order, L2 generates a steady stream of transactions (e.g., one 128B transaction per cache line miss), and the memory controller may not need to merge them (since they’re already contiguous). So the logical DRAM request count matches the L2 transaction count, even though the physical bus transfer size varies drastically between the two GPUs.

内容的提问来源于stack exchange，提问作者Johns Paul