AWS EC2实例周期性故障求助：Apache无响应、SSH无法连接

阿华AIGC实验室

2026-5-27

Troubleshooting Periodic EC2 Instance Failures on AWS

Hey there, let's work through figuring out why your 5 t2.medium EC2 instances are periodically going unresponsive—where Apache stops working and you can't even SSH in, requiring a console reboot. I see you've included monitoring screenshots from the past 12 hours, but since external links aren't accessible here, I'll focus on the most common root causes tied to your setup (EFS-shared /var/www, RDS MariaDB, Application Load Balancer) and walk you through structured troubleshooting steps:

1. First, dig into EC2 system-level issues (SSH unresponsiveness points to system blockages)

Pull system logs right after a reboot:
- Grab the system log via the AWS Console (EC2 > Your Instance > Actions > Monitor and troubleshoot > Get system log) or check CloudWatch Logs if you've set up log aggregation. Look for:
  - OOM Killer messages: t2.medium only has 4GB of RAM. If Apache, database clients, or EFS-related processes are gobbling up all memory, the kernel might kill critical processes or freeze the system entirely. Search for lines like Out of memory: Killed process.
  - Disk I/O bottlenecks: Check dmesg entries for disk thrashing, or review iostat data (if you have monitoring enabled) showing 100% utilization on your root EBS volume. Even though /var/www is on EFS, the root volume could be hitting limits from logs or other system files.
  - Kernel panics or hardware errors: Look for Kernel panic or hardware timeout messages—these would indicate a low-level system or underlying hardware failure.
Enable Enhanced Monitoring for EC2:
- Turn on AWS Enhanced Monitoring to get per-second metrics for CPU, memory, disk I/O, and network. This will help you catch spikes right before the failure that basic CloudWatch metrics (which are 5-minute averages) might miss.
Check t2 burst credit exhaustion:
- t2 instances rely on CPU burst credits. If your instances are consistently using more than their 20% baseline CPU, they'll exhaust credits and get severely throttled—leading to unresponsiveness. Check the CPUCreditBalance metric in CloudWatch; if it drops to 0 regularly, consider upgrading to a non-burstable instance type like m5.large.

2. Rule out EFS-related problems (since `/var/www` is mounted to EFS)

Check EFS performance metrics:
- In CloudWatch, look at metrics like PercentIOLimit, BurstCreditBalance, and ClientConnections. If EFS is hitting its I/O limit (especially in General Purpose mode), it could cause EC2 instances to hang when accessing files on /var/www—this would block Apache and potentially even SSH if system processes depend on EFS (unlikely, but possible if you moved other directories there).
Verify EFS mount stability:
- Check /var/log/messages or /var/log/syslog on your EC2 instances for NFS-related errors (EFS uses NFS). If the mount drops or has intermittent connectivity, processes can freeze waiting for I/O.
- Improve mount resilience by updating your /etc/fstab EFS entry to include these options: rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2—these settings help with NFS reliability during temporary blips.

3. Check if RDS MariaDB is cascading issues to EC2

Review RDS metrics during failure windows:
- Look at CloudWatch metrics for your r4.xlarge RDS instance: CPUUtilization, FreeableMemory, CommitLatency, and DatabaseConnections. If RDS is under heavy load (e.g., 100% CPU, long query latencies), Apache processes can hang waiting for database responses, leading to resource exhaustion on EC2.
- Enable slow query logs in MariaDB to identify long-running queries that might be locking up the database and causing backpressure on your app servers.
Validate database connection pooling:
- If your Apache app isn't using proper connection pooling, it might spawn too many connections to RDS, hitting the max connection limit. This causes new requests to hang, eventually overwhelming EC2 instances. Check your app's connection settings and the RDS MaxConnections parameter.

4. Analyze ALB and traffic patterns

Check ALB metrics around failure time:
- Look at RequestCount, TargetResponseTime, and HTTPCode_Target_5XX metrics. A sudden traffic spike could overwhelm your EC2 instances, or the ALB might start marking instances as unhealthy (though this is a symptom of the instances being unresponsive, not the cause).
- Enable ALB access logs to check for unusual requests (e.g., large payloads, repetitive malicious traffic) that could trigger the failure.

5. Set up automated tools to catch issues early

Create CloudWatch Alarms:
- Set up alarms for critical metrics:
  - CPUUtilization > 90% for 5 minutes
  - MemoryUtilization > 95%
  - CPUCreditBalance < 100
  - EFS PercentIOLimit > 80%
- These will alert you right before a failure, giving you real-time context to diagnose.
Use AWS Systems Manager Run Command:
- If you can't SSH into instances during a failure, set up Run Command to run diagnostic commands (like top, free, iostat) remotely. You can even schedule periodic checks to capture data before the instance goes down.
Enable EC2 Instance Recovery:
- For t2.medium instances, enable Instance Recovery in the AWS Console—this will automatically reboot instances if they become unresponsive due to underlying hardware issues. It's a temporary workaround to reduce downtime while you find the root cause.