如何用CUDA加速基于scipy.sparse.csc_matrix.dot的稀疏矩阵-稠密向量乘积
Alright, let's figure out how to get that sparse-dense matrix-vector multiplication (Ax) blazing fast on your CUDA GPU—especially since you're running it repeatedly and need top performance. Here's a practical, step-by-step guide with concrete code examples and optimization tips tailored to your 15k×15k, 5% density matrix:
Before diving in, make sure you have:
- A CUDA-capable GPU (compute capability ≥ 3.5, which covers most modern GPUs)
- The right GPU-accelerated Python libraries installed (we'll cover two popular options below)
2.1 CuPy: Drop-in Replacement for Scipy (Lowest Migration Cost)
CuPy is designed to mirror Scipy's API almost exactly, so you won't have to rewrite much code. It offloads all computations to the GPU automatically.
Setup
Install CuPy matching your CUDA version (replace cuda12x with your version, e.g., cuda11x for CUDA 11):
pip install cupy-cuda12x
Code Example
The key here is to transfer data to the GPU once—don't do this inside your loop (PCIe data transfer is way slower than GPU computation):
import cupy as cp import numpy as np from scipy.sparse import csc_matrix # Your existing Scipy CSC matrix and NumPy vector (example values here) A_scipy = csc_matrix((15000, 15000), density=0.05) x_np = np.random.rand(15000) # Move data to GPU ONCE A_cupy = cp.sparse.csc_matrix(A_scipy) x_cupy = cp.array(x_np) # Repeated multiplication (runs entirely on GPU) for _ in range(1000): y_cupy = A_cupy.dot(x_cupy) # Transfer result back to CPU only if you need it y_np = cp.asnumpy(y_cupy)
CuPy's sparse dot implementation is optimized for CUDA, so this will give you a huge speedup over Scipy's CPU version for repeated runs.
2.2 PyTorch: Great for Deep Learning Workflows
If you're already working in the PyTorch ecosystem, or need to integrate this with other GPU-based tasks, PyTorch's sparse tensor support is a solid choice.
Setup
Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Code Example
PyTorch uses COO format for sparse tensors (best supported for GPU operations), so we'll convert your Scipy CSC matrix to COO first:
import torch import numpy as np from scipy.sparse import csc_matrix # Your existing Scipy matrix and NumPy vector A_scipy = csc_matrix((15000, 15000), density=0.05) x_np = np.random.rand(15000) # Convert Scipy CSC to PyTorch sparse CUDA tensor indices = torch.tensor(np.vstack((A_scipy.row, A_scipy.col)), dtype=torch.long) values = torch.tensor(A_scipy.data, dtype=torch.float32) A_torch = torch.sparse_coo_tensor(indices, values, size=A_scipy.shape).cuda() # Move vector to GPU x_torch = torch.tensor(x_np, dtype=torch.float32).cuda() # Repeated multiplication (note: we unsqueeze to make x a column vector for sparse.mm) for _ in range(1000): y_torch = torch.sparse.mm(A_torch, x_torch.unsqueeze(1)).squeeze() # Transfer back to CPU if needed y_np = y_torch.cpu().numpy()
- Avoid GPU-CPU Data Transfers in Loops: As mentioned earlier, moving data between CPU and GPU is a major bottleneck. Keep your matrix and vectors on the GPU for all repeated computations.
- Use Lower Precision if Possible: If your use case allows it, switch to
float16(half-precision) or evenfloat8—GPUs process these much faster. For CuPy, usecp.float16; for PyTorch, usetorch.float16. - Batch Your Vectors: If you have multiple x vectors to compute Ax for, stack them into a dense matrix and run a single sparse-dense matrix multiplication instead of multiple vector multiplies. This is far more efficient.
- Pin Host Memory: If you need to frequently send new data from CPU to GPU, use pinned memory (CuPy's
cp.cuda.alloc_pinned_memoryor PyTorch'storch.cuda.pin_memory()) to speed up transfers.
Always test to make sure you're getting the performance you expect. Use timeit to measure per-iteration times:
import timeit # Test CuPy performance def cupy_mult(): return A_cupy.dot(x_cupy) print(f"CuPy average time per iteration: {timeit.timeit(cupy_mult, number=1000)/1000:.6f} seconds") # Test PyTorch performance def torch_mult(): return torch.sparse.mm(A_torch, x_torch.unsqueeze(1)).squeeze() print(f"PyTorch average time per iteration: {timeit.timeit(torch_mult, number=1000)/1000:.6f} seconds")
Compare these times to your original Scipy CPU implementation to see the speedup—you should see anywhere from 10x to 100x depending on your GPU.
内容的提问来源于stack exchange,提问作者rkp




