如何用CUDA加速基于scipy.sparse.csc_matrix.dot的稀疏矩阵-稠密向量乘积

阿华AIGC实验室

2026-5-19

Alright, let's figure out how to get that sparse-dense matrix-vector multiplication (Ax) blazing fast on your CUDA GPU—especially since you're running it repeatedly and need top performance. Here's a practical, step-by-step guide with concrete code examples and optimization tips tailored to your 15k×15k, 5% density matrix:

1. Prerequisites First

Before diving in, make sure you have:

A CUDA-capable GPU (compute capability ≥ 3.5, which covers most modern GPUs)
The right GPU-accelerated Python libraries installed (we'll cover two popular options below)

2. Implementation Options

2.1 CuPy: Drop-in Replacement for Scipy (Lowest Migration Cost)

CuPy is designed to mirror Scipy's API almost exactly, so you won't have to rewrite much code. It offloads all computations to the GPU automatically.

Setup

Install CuPy matching your CUDA version (replace cuda12x with your version, e.g., cuda11x for CUDA 11):

pip install cupy-cuda12x

Code Example

The key here is to transfer data to the GPU once—don't do this inside your loop (PCIe data transfer is way slower than GPU computation):

import cupy as cp
import numpy as np
from scipy.sparse import csc_matrix

# Your existing Scipy CSC matrix and NumPy vector (example values here)
A_scipy = csc_matrix((15000, 15000), density=0.05)
x_np = np.random.rand(15000)

# Move data to GPU ONCE
A_cupy = cp.sparse.csc_matrix(A_scipy)
x_cupy = cp.array(x_np)

# Repeated multiplication (runs entirely on GPU)
for _ in range(1000):
    y_cupy = A_cupy.dot(x_cupy)

# Transfer result back to CPU only if you need it
y_np = cp.asnumpy(y_cupy)

CuPy's sparse dot implementation is optimized for CUDA, so this will give you a huge speedup over Scipy's CPU version for repeated runs.

2.2 PyTorch: Great for Deep Learning Workflows

If you're already working in the PyTorch ecosystem, or need to integrate this with other GPU-based tasks, PyTorch's sparse tensor support is a solid choice.

Setup

Install PyTorch with CUDA support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Code Example

PyTorch uses COO format for sparse tensors (best supported for GPU operations), so we'll convert your Scipy CSC matrix to COO first:

import torch
import numpy as np
from scipy.sparse import csc_matrix

# Your existing Scipy matrix and NumPy vector
A_scipy = csc_matrix((15000, 15000), density=0.05)
x_np = np.random.rand(15000)

# Convert Scipy CSC to PyTorch sparse CUDA tensor
indices = torch.tensor(np.vstack((A_scipy.row, A_scipy.col)), dtype=torch.long)
values = torch.tensor(A_scipy.data, dtype=torch.float32)
A_torch = torch.sparse_coo_tensor(indices, values, size=A_scipy.shape).cuda()

# Move vector to GPU
x_torch = torch.tensor(x_np, dtype=torch.float32).cuda()

# Repeated multiplication (note: we unsqueeze to make x a column vector for sparse.mm)
for _ in range(1000):
    y_torch = torch.sparse.mm(A_torch, x_torch.unsqueeze(1)).squeeze()

# Transfer back to CPU if needed
y_np = y_torch.cpu().numpy()

3. Critical Performance Tips for Repeated Runs

Avoid GPU-CPU Data Transfers in Loops: As mentioned earlier, moving data between CPU and GPU is a major bottleneck. Keep your matrix and vectors on the GPU for all repeated computations.
Use Lower Precision if Possible: If your use case allows it, switch to float16 (half-precision) or even float8—GPUs process these much faster. For CuPy, use cp.float16; for PyTorch, use torch.float16.
Batch Your Vectors: If you have multiple x vectors to compute Ax for, stack them into a dense matrix and run a single sparse-dense matrix multiplication instead of multiple vector multiplies. This is far more efficient.
Pin Host Memory: If you need to frequently send new data from CPU to GPU, use pinned memory (CuPy's cp.cuda.alloc_pinned_memory or PyTorch's torch.cuda.pin_memory()) to speed up transfers.

4. Benchmark to Verify Speedups

Always test to make sure you're getting the performance you expect. Use timeit to measure per-iteration times:

import timeit

# Test CuPy performance
def cupy_mult():
    return A_cupy.dot(x_cupy)

print(f"CuPy average time per iteration: {timeit.timeit(cupy_mult, number=1000)/1000:.6f} seconds")

# Test PyTorch performance
def torch_mult():
    return torch.sparse.mm(A_torch, x_torch.unsqueeze(1)).squeeze()

print(f"PyTorch average time per iteration: {timeit.timeit(torch_mult, number=1000)/1000:.6f} seconds")

Compare these times to your original Scipy CPU implementation to see the speedup—you should see anywhere from 10x to 100x depending on your GPU.

内容的提问来源于stack exchange，提问作者rkp