如何通过NVIDIA MPS减小CUDA上下文大小？实际内存占用疑问

阿华AIGC实验室

2026-5-14

Why Isn't NVIDIA MPS Reducing Per-Process GPU Memory Usage?

Great question—this is a super common gotcha with NVIDIA MPS, so let’s unpack what’s going on and how you can get the memory savings you’re expecting.

First, let’s clarify a key point from the docs: MPS shares scheduling resources and core context storage (like the GPU’s dispatch engine and some global context state), but it doesn’t eliminate all per-process GPU memory overhead. Your 300MB per-process footprint is likely a mix of shared memory mappings (counted in each process’s stats but only stored once on the GPU) and unavoidable private per-process context data.

Here’s how to dig into this and reduce your per-process memory usage:

1. Verify MPS is Actually Active for Your Clients

It’s easy to start the MPS server but forget to configure your client processes to use it. Follow these steps to confirm:

Start the MPS server with:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-mps-log
nvidia-cuda-mps-control -d

For every client process you launch, set the same CUDA_MPS_PIPE_DIRECTORY environment variable. This ensures they connect to the MPS server instead of creating standalone CUDA contexts.
Check nvidia-smi—you should see an nvidia-cuda-mps-server process running, and your client processes will show up under the MPS server’s context (instead of having their own separate GPU contexts).

2. Distinguish Between Shared vs. Private Memory

The 300MB per-process number you’re seeing is probably including shared memory that’s only stored once on the GPU. Instead of looking at per-process stats, check the total GPU memory used with:

nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

With MPS enabled, total memory should scale much less linearly with the number of processes (e.g., 10 processes might use 1GB instead of 3GB). If total memory is still scaling linearly, your clients aren’t using MPS correctly.

3. Optimize Per-Process CUDA Usage

Even with MPS, some private per-process overhead is unavoidable—but you can minimize it:

Avoid redundant CUDA context initialization: Don’t create/destroy CUDA contexts multiple times per process; initialize once and reuse.
Reuse CUDA objects: Share streams, events, and library handles (like cuBLAS/cuDNN handles) within a process instead of creating new ones. Each of these adds small overhead that adds up.
Trim implicit library allocations: Libraries like cuBLAS and cuDNN often allocate temporary workspace memory. Configure them to use smaller workspaces (if possible) or reuse workspace buffers across operations.

4. Tune MPS Configuration for Memory Efficiency

You can adjust MPS’s server settings to optimize memory pooling:

Edit the MPS config file (default path: /etc/nvidia-mps/config) or specify a custom path via CUDA_MPS_CONFIG_FILE.
Adjust parameters like MEMORY_POOL_SIZE to set a dedicated shared memory pool for clients. This helps the server reuse memory across processes instead of allocating new blocks for each client.

5. Use Modern CUDA Versions

Older CUDA versions had less efficient MPS memory sharing. If you’re on a version before CUDA 11.x, upgrading might give you better memory savings—newer releases improved how MPS handles context sharing and memory pooling.

Final Note

Remember: MPS’s primary goal is to reduce total GPU memory consumption by sharing core resources, not to make individual process memory stats look smaller. The per-process 300MB count is misleading because it includes shared memory that’s only stored once on the hardware. Always judge MPS’s effectiveness by total GPU memory used, not per-process numbers.

内容的提问来源于stack exchange，提问作者alex