关于Bank Conflict、Memory Padding与Coalesced Memory Access的困惑及技术澄清请求

阿华AIGC实验室

2026-4-28

Clarifying Bank Conflicts, Memory Padding, and Coalesced Access in CUDA

Great question—this is such a common point of confusion when diving into CUDA memory optimizations, so let’s break this down clearly.

First, let’s anchor the two concepts to their respective memory layers, because that’s where the "apparent conflict" dissolves:

Coalesced memory access is all about how threads in a warp read from or write to global memory. The goal here is to have consecutive threads access consecutive memory addresses, so the GPU can batch the request into a single (or minimal number of) memory transactions instead of multiple scattered ones. This optimizes data transfer between the device’s global memory and streaming multiprocessors (SMs).
Bank conflicts and memory padding apply to shared memory—the fast, on-SM memory. Shared memory is divided into banks (typically 32 banks for modern GPUs), and if multiple threads in a warp access the same bank simultaneously, it causes a conflict that serializes the access. Padding adds unused elements to shared memory arrays to shift addresses so that concurrent accesses land in different banks.

Why Padding Shared Memory Doesn’t Break Coalesced Global Access

Let’s use your 16×16 matrix example to walk through the flow:

Global Memory Load: When your thread block loads the 16×16 tile from global memory, each thread in the warp accesses a consecutive element in global memory. For a 4-byte element (like float), 16 threads accessing 16 elements = 64 bytes, which aligns perfectly with a global memory cache line. This is a fully coalesced access—no issues here.
Storing to Padded Shared Memory: You then write these 16 elements to a 16×17 shared memory array. The extra padding element per row lives in shared memory only; it doesn’t come from global memory. The global memory addresses you accessed were still consecutive and aligned—you’re just adding a dummy element in the shared memory layout to break bank conflicts for subsequent shared memory accesses (like when threads need to read neighboring elements for convolution or matrix multiplication).

Addressing Your Alignment Concern

The key point here is that padding affects shared memory alignment, not global memory alignment. When you load from global memory, you’re still accessing a contiguous, aligned block of data. The padding is a local adjustment within the shared memory buffer—you’re not shifting the global memory addresses you read from. The dummy padding element isn’t loaded from global memory at all; it’s just an unused slot in shared memory that prevents threads from hitting the same bank.

Are They Contradictory? Absolutely Not

These two optimizations work together, not against each other:

Coalesced access ensures you’re moving data from global memory to the SM as efficiently as possible.
Shared memory padding ensures that once the data is in the fast shared memory, your threads can access it without bank conflicts, maximizing the SM’s memory throughput.

Think of it like this: you first use coalesced access to get the data to the SM quickly, then use padding to make sure you can work with that data efficiently once it’s there. They’re complementary steps in optimizing memory performance.

内容的提问来源于stack exchange，提问作者SimonH