Ansible长任务SSH会话中断问题排查与解决咨询

阿华AIGC实验室

2026-5-15

SSH Session Dropped During Long-Running Ansible Task (6h vs 26h Duration)

Let's break down why your SSH connection is closing after 6 hours, even though the target server's SSH config looks like it should support longer sessions, plus actionable fixes to get your task across the finish line.

Root Cause Analysis

The issue isn't likely the target server's SSH settings alone—here's what's probably happening:

Intermediate Network Device Timeout
Most firewalls, NAT gateways, or load balancers have a default idle connection timeout (often 6 hours). They'll drop connections that don't send traffic, regardless of your SSH server's ClientAliveInterval setting. The 6-hour failure window lines up perfectly with this common network default.
Control Node SSH Client Timeout
Your target server's config only controls the server-side behavior. The SSH client running on your Ansible control node might have its own shorter timeout settings that are kicking in before the server's keepalives take effect.
Unapplied SSHD Config (Edge Case)
Double-check if you restarted the sshd service after updating the target server's config—if not, the new ClientAliveInterval values won't be active.

Fixes & Workarounds

1. Force Client-Side SSH Keepalives

Add these settings to your ansible.cfg or inventory to send regular keepalive packets from the Ansible control node. This prevents network devices from marking the connection as idle:

[ssh_connection]
ssh_args = -o ServerAliveInterval=60 -o ServerAliveCountMax=1000

ServerAliveInterval=60: Sends a keepalive every 60 seconds to keep the connection "active".
ServerAliveCountMax=1000: Tolerates up to ~16 minutes of missed keepalives before closing the connection (safe for most network blips).

2. Verify & Apply Target Server SSHD Config

Confirm the target server's /etc/ssh/sshd_config has the correct values, then restart the service to apply changes:

sudo systemctl restart sshd

While your ClientAliveInterval 172000 (≈47 hours) is technically long enough, combining it with client-side keepalives eliminates network-related timeouts entirely.

3. Run the Task Asynchronously (Best Practice for Long Tasks)

Instead of holding the SSH connection open for 26 hours, let Ansible start the task, disconnect, and check back periodically. Modify your task like this:

- name: Executing script asynchronously
  remote_user: "{{admin_user}}"
  become: yes
  shell: sudo -u test bash ./customscript.sh > /log_dir/customscript.log 2>&1
  args:
    chdir: "deployment_source/common"
  async: 93600  # 26 hours in seconds
  poll: 300     # Check task status every 5 minutes (300 seconds)
  tags:
    - custom-test

This way, even if SSH connections drop during polling intervals, Ansible will just reconnect to check status—your long-running script continues uninterrupted on the target server.

4. Adjust Network Device Timeouts (If Accessible)

If you manage the firewalls/load balancers between your control node and target server, increase their idle connection timeout to be longer than 26 hours, or enable SSH keepalive inspection to recognize the keepalive packets as valid traffic.

内容的提问来源于stack exchange，提问作者Ifti