AWS EC2实例周期性故障求助:Apache无响应、SSH无法连接
Troubleshooting Periodic EC2 Instance Failures on AWS
Hey there, let's work through figuring out why your 5 t2.medium EC2 instances are periodically going unresponsive—where Apache stops working and you can't even SSH in, requiring a console reboot. I see you've included monitoring screenshots from the past 12 hours, but since external links aren't accessible here, I'll focus on the most common root causes tied to your setup (EFS-shared /var/www, RDS MariaDB, Application Load Balancer) and walk you through structured troubleshooting steps:
1. First, dig into EC2 system-level issues (SSH unresponsiveness points to system blockages)
- Pull system logs right after a reboot:
- Grab the system log via the AWS Console (EC2 > Your Instance > Actions > Monitor and troubleshoot > Get system log) or check CloudWatch Logs if you've set up log aggregation. Look for:
- OOM Killer messages: t2.medium only has 4GB of RAM. If Apache, database clients, or EFS-related processes are gobbling up all memory, the kernel might kill critical processes or freeze the system entirely. Search for lines like
Out of memory: Killed process. - Disk I/O bottlenecks: Check
dmesgentries for disk thrashing, or reviewiostatdata (if you have monitoring enabled) showing 100% utilization on your root EBS volume. Even though/var/wwwis on EFS, the root volume could be hitting limits from logs or other system files. - Kernel panics or hardware errors: Look for
Kernel panicor hardware timeout messages—these would indicate a low-level system or underlying hardware failure.
- OOM Killer messages: t2.medium only has 4GB of RAM. If Apache, database clients, or EFS-related processes are gobbling up all memory, the kernel might kill critical processes or freeze the system entirely. Search for lines like
- Grab the system log via the AWS Console (EC2 > Your Instance > Actions > Monitor and troubleshoot > Get system log) or check CloudWatch Logs if you've set up log aggregation. Look for:
- Enable Enhanced Monitoring for EC2:
- Turn on AWS Enhanced Monitoring to get per-second metrics for CPU, memory, disk I/O, and network. This will help you catch spikes right before the failure that basic CloudWatch metrics (which are 5-minute averages) might miss.
- Check t2 burst credit exhaustion:
- t2 instances rely on CPU burst credits. If your instances are consistently using more than their 20% baseline CPU, they'll exhaust credits and get severely throttled—leading to unresponsiveness. Check the
CPUCreditBalancemetric in CloudWatch; if it drops to 0 regularly, consider upgrading to a non-burstable instance type like m5.large.
- t2 instances rely on CPU burst credits. If your instances are consistently using more than their 20% baseline CPU, they'll exhaust credits and get severely throttled—leading to unresponsiveness. Check the
2. Rule out EFS-related problems (since /var/www is mounted to EFS)
- Check EFS performance metrics:
- In CloudWatch, look at metrics like
PercentIOLimit,BurstCreditBalance, andClientConnections. If EFS is hitting its I/O limit (especially in General Purpose mode), it could cause EC2 instances to hang when accessing files on/var/www—this would block Apache and potentially even SSH if system processes depend on EFS (unlikely, but possible if you moved other directories there).
- In CloudWatch, look at metrics like
- Verify EFS mount stability:
- Check
/var/log/messagesor/var/log/syslogon your EC2 instances for NFS-related errors (EFS uses NFS). If the mount drops or has intermittent connectivity, processes can freeze waiting for I/O. - Improve mount resilience by updating your
/etc/fstabEFS entry to include these options:rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2—these settings help with NFS reliability during temporary blips.
- Check
3. Check if RDS MariaDB is cascading issues to EC2
- Review RDS metrics during failure windows:
- Look at CloudWatch metrics for your r4.xlarge RDS instance:
CPUUtilization,FreeableMemory,CommitLatency, andDatabaseConnections. If RDS is under heavy load (e.g., 100% CPU, long query latencies), Apache processes can hang waiting for database responses, leading to resource exhaustion on EC2. - Enable slow query logs in MariaDB to identify long-running queries that might be locking up the database and causing backpressure on your app servers.
- Look at CloudWatch metrics for your r4.xlarge RDS instance:
- Validate database connection pooling:
- If your Apache app isn't using proper connection pooling, it might spawn too many connections to RDS, hitting the max connection limit. This causes new requests to hang, eventually overwhelming EC2 instances. Check your app's connection settings and the RDS
MaxConnectionsparameter.
- If your Apache app isn't using proper connection pooling, it might spawn too many connections to RDS, hitting the max connection limit. This causes new requests to hang, eventually overwhelming EC2 instances. Check your app's connection settings and the RDS
4. Analyze ALB and traffic patterns
- Check ALB metrics around failure time:
- Look at
RequestCount,TargetResponseTime, andHTTPCode_Target_5XXmetrics. A sudden traffic spike could overwhelm your EC2 instances, or the ALB might start marking instances as unhealthy (though this is a symptom of the instances being unresponsive, not the cause). - Enable ALB access logs to check for unusual requests (e.g., large payloads, repetitive malicious traffic) that could trigger the failure.
- Look at
5. Set up automated tools to catch issues early
- Create CloudWatch Alarms:
- Set up alarms for critical metrics:
CPUUtilization> 90% for 5 minutesMemoryUtilization> 95%CPUCreditBalance< 100EFS PercentIOLimit> 80%
- These will alert you right before a failure, giving you real-time context to diagnose.
- Set up alarms for critical metrics:
- Use AWS Systems Manager Run Command:
- If you can't SSH into instances during a failure, set up Run Command to run diagnostic commands (like
top,free,iostat) remotely. You can even schedule periodic checks to capture data before the instance goes down.
- If you can't SSH into instances during a failure, set up Run Command to run diagnostic commands (like
- Enable EC2 Instance Recovery:
- For t2.medium instances, enable Instance Recovery in the AWS Console—this will automatically reboot instances if they become unresponsive due to underlying hardware issues. It's a temporary workaround to reduce downtime while you find the root cause.
内容的提问来源于stack exchange,提问作者Ulises




