Keras model.fit()源码查询及PyCUDA并行训练适配技术咨询

阿华AIGC实验室

2026-5-11

Understanding Keras' model.fit() and Integrating PyCUDA for Parallel Training

Hey there! Let's break down your questions step by step, since you're looking to dig into Keras' training loop and leverage PyCUDA for parallelizing epoch-level training. First, here's your existing code for reference:

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
import cv2

print(tf.__version__)

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

train_images.shape
len(train_labels)
train_labels
test_images.shape
len(test_labels)

plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.show()

train_images = train_images / 255.0
test_images = test_images / 255.0

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.show()

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=100)

1. What's the Source Code of `model.fit()`?

Since you're using tf.keras (TensorFlow's implementation of Keras), the fit() method lives in TensorFlow's core codebase. Here's a breakdown of where to find it and what it does:

Where to Locate the Source

If you have TensorFlow installed locally, you can find the fit() method in:
{your_python_env}/lib/pythonX.X/site-packages/tensorflow/python/keras/engine/training.py

Core Functionality of `fit()`

At a high level, model.fit() is a high-level wrapper that orchestrates the entire training loop. Its key steps are:

Input Handling: Converts your training data (numpy arrays, in your case) into a tf.data.Dataset or iterable batches, handling shuffling, batching, and prefetching.
Training Initialization: Sets up loss trackers, metric calculators, and initializes optimizer states (like momentum for Adam).
Epoch Loop: Iterates over each epoch you specify:
- Batch Loop: For each batch of data:
  1. Runs forward propagation: Passes the batch through the model to compute predictions.
  2. Calculates loss: Compares predictions to labels using your specified loss function.
  3. Runs backward propagation: Computes gradients of the loss with respect to model weights.
  4. Updates weights: Uses the optimizer to adjust weights based on the computed gradients.
- Metrics & Callbacks: Tracks accuracy/loss, runs callbacks (like early stopping or model saving), and prints progress.
Finalization: Returns training history and cleans up resources.

The actual low-level computation (like matrix multiplications for Dense layers) is delegated to TensorFlow's GPU-optimized operations under the hood.

2. How to Extract Content for PyCUDA Kernel Functions?

First, a quick reality check: TensorFlow/Keras already uses GPU acceleration automatically if you have the GPU-enabled version installed. But if you need to customize the training loop with PyCUDA (for research or specific optimization needs), here's how to break down the components to feed into PyCUDA:

Key Components to Extract

To replicate model.fit() with PyCUDA, you need to isolate the core tensor operations that run on the GPU. These include:

a. Model Weights & Biases

First, extract your pre-compiled model's initial weights as numpy arrays:

# Extract weights from each layer
layer_weights = []
for layer in model.layers:
    if hasattr(layer, 'weights'):
        weights, biases = layer.get_weights()
        layer_weights.append((weights, biases))

You'll need to transfer these arrays to GPU memory using PyCUDA's cuda.mem_alloc() and cuda.memcpy_htod().

b. Training Data

Your normalized train_images (shape (60000, 28, 28)) and train_labels (shape (60000,)) need to be converted to float32 (PyCUDA's preferred type) and transferred to GPU memory:

# Convert data to float32
train_images_gpu = train_images.astype(np.float32)
train_labels_gpu = train_labels.astype(np.int32)

# Transfer to GPU memory (using PyCUDA)
import pycuda.driver as cuda
import pycuda.autoinit

# Allocate GPU memory
d_images = cuda.mem_alloc(train_images_gpu.nbytes)
d_labels = cuda.mem_alloc(train_labels_gpu.nbytes)

# Copy data from CPU to GPU
cuda.memcpy_htod(d_images, train_images_gpu)
cuda.memcpy_htod(d_labels, train_labels_gpu)

c. Core Training Operations (to Implement in PyCUDA)

You'll need to write PyCUDA kernels for each of these steps, or use PyCUDA's GPU array operations for simpler tasks:

Flatten Layer: Reshape (28,28) images to (784,) vectors. This can be done with a simple kernel or using pycuda.gpuarray.reshape().
Dense Layer with ReLU: Compute output = relu(input @ weights + bias). Use PyCUDA's BLAS bindings (via pycuda.gpuarray.dot()) for matrix multiplication, then add the bias and apply the ReLU activation (write a kernel to set negative values to 0).
Dense Layer with Softmax: Compute stable softmax (subtract the max logit first to avoid numerical overflow) using a custom kernel.
Loss Calculation: Implement sparse_categorical_crossentropy—for each sample, compute -log(softmax_output[label]) and average across the batch.
Backward Propagation: Calculate gradients for weights/biases. For example, the gradient of the final Dense layer's weights is (softmax_output - one_hot_labels) @ input.T. You'll need kernels to compute these gradients, then propagate them back through the ReLU layer (multiply by 1 where the ReLU output was positive, 0 otherwise).
Adam Optimizer Update: Track momentum and second-moment estimates for each weight, then apply the Adam update rule to adjust weights—this requires storing optimizer states on the GPU.

Parallelizing Epochs

Wait a second—epochs are typically sequential because each epoch's weights depend on the previous epoch's results. If you want to parallelize batch processing within an epoch (which is what TensorFlow does automatically), PyCUDA can handle this by launching kernels that process multiple samples in parallel.

If you really want to parallelize epochs (e.g., training multiple independent models with different initializations across epochs), you'd need to split the training data into disjoint subsets, launch separate PyCUDA streams for each epoch's training loop, and manage independent weight states for each stream.

Final Note

Unless you have a very specific use case requiring low-level GPU control, sticking with TensorFlow/Keras' built-in GPU acceleration is almost always more efficient—they've already optimized these operations for performance and stability. But if you're doing research or learning about GPU programming, breaking down the training loop with PyCUDA is a great exercise!

内容的提问来源于stack exchange，提问作者jonny