Ansible长任务SSH会话中断问题排查与解决咨询
Let's break down why your SSH connection is closing after 6 hours, even though the target server's SSH config looks like it should support longer sessions, plus actionable fixes to get your task across the finish line.
Root Cause Analysis
The issue isn't likely the target server's SSH settings alone—here's what's probably happening:
- Intermediate Network Device Timeout
Most firewalls, NAT gateways, or load balancers have a default idle connection timeout (often 6 hours). They'll drop connections that don't send traffic, regardless of your SSH server'sClientAliveIntervalsetting. The 6-hour failure window lines up perfectly with this common network default. - Control Node SSH Client Timeout
Your target server's config only controls the server-side behavior. The SSH client running on your Ansible control node might have its own shorter timeout settings that are kicking in before the server's keepalives take effect. - Unapplied SSHD Config (Edge Case)
Double-check if you restarted thesshdservice after updating the target server's config—if not, the newClientAliveIntervalvalues won't be active.
Fixes & Workarounds
1. Force Client-Side SSH Keepalives
Add these settings to your ansible.cfg or inventory to send regular keepalive packets from the Ansible control node. This prevents network devices from marking the connection as idle:
[ssh_connection] ssh_args = -o ServerAliveInterval=60 -o ServerAliveCountMax=1000
ServerAliveInterval=60: Sends a keepalive every 60 seconds to keep the connection "active".ServerAliveCountMax=1000: Tolerates up to ~16 minutes of missed keepalives before closing the connection (safe for most network blips).
2. Verify & Apply Target Server SSHD Config
Confirm the target server's /etc/ssh/sshd_config has the correct values, then restart the service to apply changes:
sudo systemctl restart sshd
While your ClientAliveInterval 172000 (≈47 hours) is technically long enough, combining it with client-side keepalives eliminates network-related timeouts entirely.
3. Run the Task Asynchronously (Best Practice for Long Tasks)
Instead of holding the SSH connection open for 26 hours, let Ansible start the task, disconnect, and check back periodically. Modify your task like this:
- name: Executing script asynchronously remote_user: "{{admin_user}}" become: yes shell: sudo -u test bash ./customscript.sh > /log_dir/customscript.log 2>&1 args: chdir: "deployment_source/common" async: 93600 # 26 hours in seconds poll: 300 # Check task status every 5 minutes (300 seconds) tags: - custom-test
This way, even if SSH connections drop during polling intervals, Ansible will just reconnect to check status—your long-running script continues uninterrupted on the target server.
4. Adjust Network Device Timeouts (If Accessible)
If you manage the firewalls/load balancers between your control node and target server, increase their idle connection timeout to be longer than 26 hours, or enable SSH keepalive inspection to recognize the keepalive packets as valid traffic.
内容的提问来源于stack exchange,提问作者Ifti




