如何解决Keras中CNN模型二次运行时的cuDNN初始化失败问题
我遇到过不少类似的情况——首次跑CNN模型一切正常,第二次重复运行就炸锅,还得重启电脑才能解决,着实头疼。结合你的环境(Python3.6、GTX1050Ti、i7-7700K)和报错信息(UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize),给你整理几个不用重启电脑就能解决的方案:
一、手动清理TF/Keras会话资源
每次运行完模型后,TensorFlow可能没有彻底释放GPU内存和会话资源,导致第二次运行时资源冲突。你可以在每次训练结束后(或者第二次运行前)添加以下代码:
import tensorflow as tf from keras import backend as K # 清除Keras会话 K.clear_session() # 重置TF默认图并释放GPU内存 tf.compat.v1.reset_default_graph() # 强制释放GPU内存(兼容TF1.x/2.x写法) gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) except RuntimeError as e: print(e)
这段代码能帮你释放残留的GPU资源,避免cuDNN初始化时出现资源冲突。
二、设置GPU内存动态增长模式
默认情况下TensorFlow会一次性占用全部GPU显存,第二次运行时就没资源可用了。你可以在整个代码的最开头添加GPU内存增长配置,让TF按需分配显存:
import tensorflow as tf gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: # 设置GPU内存动态增长 for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") except RuntimeError as e: # 打印异常信息 print(e)
这样设置后,TF只会根据模型需求逐步占用显存,第二次运行时就不会因为显存被占满导致cuDNN初始化失败。
三、检查CUDA和cuDNN版本兼容性
你的报错核心是cuDNN初始化失败,很大可能是CUDA和cuDNN版本不匹配。Python3.6对应的TensorFlow版本一般是TF1.15或者TF2.0左右,你需要确保:
- CUDA版本和TensorFlow版本匹配(比如TF1.15对应CUDA10.0,TF2.0对应CUDA10.1)
- cuDNN版本要和CUDA版本严格对应(比如CUDA10.0对应cuDNN7.6.5)
如果版本不匹配,即使第一次运行侥幸成功,第二次也会因为底层资源冲突报错。你可以通过pip show tensorflow查看TF版本,再核对对应的CUDA/cuDNN版本,重新安装匹配的版本。
四、重启Python内核而非电脑
不用每次都重启电脑,在Jupyter Notebook或者IDE中,直接重启Python内核就行——比如在Jupyter里点"Kernel"->"Restart",然后重新运行代码,这样就能释放之前的GPU资源,比重启电脑高效多了。
五、关闭其他占用GPU的程序
打开任务管理器(Windows)或者执行nvidia-smi命令(终端),看看有没有其他程序(比如其他Python进程、游戏、视频渲染软件)占用了GPU显存,把这些进程关掉,再重新运行模型,就能避免显存不足导致的cuDNN初始化问题。
附你的问题相关信息:
训练代码片段:
history = model.fit( train_generator, steps_per_epoch=train_steps, epochs=epochs, validation_data=validation_generator, validation_steps=valid_steps)
报错日志:
UnknownError Traceback (most recent call last)
in
22 epochs=epochs,
23 validation_data=validation_generator,
---> 24 validation_steps=valid_steps)
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
---> 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate(self,
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
---> 342 total_epochs=epochs)
343 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
344
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
---> 128 batch_outs = execution_function(iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py in execution_function(input_fn)
96 #numpytranslates Tensors to values in Eager mode.
97 return nest.map_structure(_non_none_constant_value,
---> 98 distributed_function(input_fn))
99
100 return execution_function
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\def_function.py in call(self, *args, **kwds)
566 xla_context.Exit()
567 else:
---> 568 result = self._call(*args, **kwds)
569
570 if tracing_count == self._get_tracing_count():
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call(self, *args, **kwds)
630 # Lifting succeeded, so variables are initialized and we can run the
631 # stateless function.
---> 632 return self._stateless_fn(*args, **kwds)
633 else:
634 canon_args, canon_kwds = \
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, *args, **kwargs)
2361 with self._lock:
2362 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2363 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
2364
2365 @property
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call(self, args, kwargs)
1609 if isinstance(t, (ops.Tensor,
1610 resource_variable_ops.BaseResourceVariable))),
-> 1611 self.captured_inputs)
1612
1613 def _call_flat(self, args, captured_inputs, cancellation_manager=None):
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1690 # No tape is watching; skip to running the function.
1691 return self._build_call_outputs(self._inference_function.call(
-> 1692 ctx, args, cancellation_manager=cancellation_manager))
1693 forward_backward = self._select_forward_and_backward_functions(
1694 args,
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, ctx, args, cancellation_manager)
543 inputs=args,
544 attrs=("executor_type", executor_type, "config_proto", config),
-> 545 ctx=ctx)
546 else:
547 outputs = execute.execute_with_cancellation(
~.conda\envs\myenv-36\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
65 else:
66 message = e.message
---> 67 six.raise_from(core._status_to_exception(e.code, message), None)
68 except TypeError as e:
69 keras_symbolic_tensors = [
~.conda\envs\myenv-36\lib\site-packages\six.py in raise_from(value, from_value)
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model/conv2d/Conv2D (defined at:24) ]] [Op:__inference_distributed_function_2996]
Function call stack:
distributed_function
环境信息:
- Python版本:3.6
- GPU:NVIDIA GeForce GTX 1050 Ti
- 内存:16GB
- 处理器:i7-7700k CPU @4.20Hz
内容的提问来源于stack exchange,提问作者Sam




