同一服务器下Nginx+PHP-FPM发起多curl请求偶发504 Gateway Timeout问题排查求助
同一服务器下Nginx+PHP-FPM发起多curl请求偶发504 Gateway Timeout问题排查求助
各位大佬,这个问题已经折腾我5个小时了,快疯了😭。我们的一台服务器突然开始偶发504网关超时错误,场景有点特殊,我把细节都列出来,求帮忙分析下!
服务器环境
- 系统:Debian 10(运行在Proxmox 7上)
- Nginx版本:1.14
- PHP版本:7.2.34 (39+020230609.84+debian101.gbpf63844)
- curl版本:7.64.0
问题复现场景
两台虚拟主机都在同一台机器上:
- 虚拟主机VH1(example.com/sd.php):代码极其简单,仅输出字符串:
<?php echo 'Response';
- 虚拟主机VH2(example1.com/sd.php):循环2次用curl请求VH1的接口:
<?php $host= 'https://example.com/sd.php'; for ( $i = 0; $i<2; $i++) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $host); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $season_data = curl_exec($ch); if (curl_errno($ch)) { print_r(curl_error($ch)); print "Error: " . curl_error($ch); exit(); } curl_close($ch); echo "Responded" . $i."<br />"; }
现在的问题是:在浏览器访问VH2的sd.php,每刷新5-9次就会出现一次504 Gateway Timeout。
相关日志信息
VH1的Nginx错误日志
出现504时,VH1的日志会报:
2023/07/29 21:16:04 [error] 27138#27138: *851 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example.com, request: "GET /sd.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "example.com"
PHP-FPM状态与日志
一开始以为是PHP-FPM进程不足,但监控显示进程完全充足:
[29-Jul-2023 21:26:32.662531] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1 [29-Jul-2023 21:26:33.663691] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1 [29-Jul-2023 21:26:34.664941] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1
PHP-FPM关键配置:
listen = /run/php/php7.2-fpm.sock request_terminate_timeout = 30s
(之前用的TCP socket,今天改成unix sock后问题依旧)
启用慢日志后,日志指向VH2的sd.php里的curl_exec调用:
[29-Jul-2023 21:23:24.315874] [pool www] pid 15840 script_filename = /var/www/html/vh2/public/sd.php [0x00007fb8fd21e110] curl_exec() (...)/vh1/public/sd.php:18
VH2的Nginx日志
VH2的Nginx即使开了debug级别的错误日志,出现504时也没有任何记录。
Nginx相关配置
我已经调大了超时时间,但问题依旧:
client_body_timeout 10; client_header_timeout 10; keepalive_timeout 30; send_timeout 10; proxy_connect_timeout 600s; proxy_send_timeout 600s; proxy_read_timeout 600s; fastcgi_send_timeout 600s; fastcgi_read_timeout 600s; fastcgi_temp_file_write_size 256k;
压力测试结果
用ab工具直接压VH1的接口,不管是本地还是远程,结果都完全正常,没有失败请求:
tio ~ $ ab -n 5000 -c 100 https://example.com/sd.php This is ApacheBench, Version 2.3 <$Revision: 1879490 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking example.com (be patient) Completed 500 requests Completed 1000 requests Completed 1500 requests Completed 2000 requests Completed 2500 requests Completed 3000 requests Completed 3500 requests Completed 4000 requests Completed 4500 requests Completed 5000 requests Finished 5000 requests Server Software: pp7-2 Server Hostname: example.com Server Port: 443 SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256 Server Temp Key: X25519 253 bits TLS Server Name: example.com Document Path: /sd.php Document Length: 5 bytes Concurrency Level: 100 Time taken for tests: 9.910 seconds Complete requests: 5000 Failed requests: 0 Total transferred: 1940000 bytes HTML transferred: 25000 bytes Requests per second: 504.54 [#/sec] (mean) Time per request: 198.200 [ms] (mean) Time per request: 1.982 [ms] (mean, across all concurrent requests) Transfer rate: 191.17 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 113 136 23.5 133 452 Processing: 41 57 20.3 52 425 Waiting: 41 56 19.5 52 424 Total: 161 192 34.0 186 594 Percentage of the requests served within a certain time (ms) 50% 186 66% 192 75% 196 80% 199 90% 208 95% 219 98% 261 99% 334 100% 594 (longest request)
进一步测试:用stream_socket替代curl
我换了一段用stream_socket发起请求的代码(sd2.php),结果还是偶发504:
<?php function connect($host, $port, $timeout = 1) { $conn_str = "tcp://{$host}:{$port}"; $opts = STREAM_CLIENT_CONNECT | STREAM_CLIENT_ASYNC_CONNECT | STREAM_CLIENT_PERSISTENT; $sock = stream_socket_client($conn_str, $errno, $errstr, $timeout, $opts); return $sock; } $sock = connect("google.com", 80); $req = "GET / HTTP/1.0\r\nHost: www.google.com\r\nAccept: */*\r\n\r\n"; $len = fwrite($sock, $req); $data = stream_get_contents($sock); echo $data;
此时VH2的Nginx错误日志:
2023/07/29 21:52:34 [error] 27138#27138: *35680 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com" 2023/07/29 21:53:12 [info] 27138#27138: *35925 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too (104: Connection reset by peer) while sending request to upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com" 2023/07/29 21:54:14 [error] 27138#27138: *36093 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com"
对应的PHP-FPM日志显示进程超时被终止:
[29-Jul-2023 21:53:54.440363] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 2 active children, 27 spare children, 29 running children. Spawning rate 1 [29-Jul-2023 21:53:54.510612] WARNING: pid 15827, fpm_request_check_timed_out(), line 278: [pool www] child 16562, script '/var/www/html/vh2//public/sd2.php' (request: "GET /sd2.php") execution timed out (42.699082 sec), terminating [29-Jul-2023 21:53:54.512034] DEBUG: pid 15827, fpm_got_signal(), line 75: received SIGCHLD [29-Jul-2023 21:53:54.512087] WARNING: pid 15827, fpm_children_bury(), line 256: [pool www] child 16562 exited on signal 15 (SIGTERM) after 702.886545 seconds from start [29-Jul-2023 21:53:54.512861] NOTICE: pid 15827, fpm_children_make(), line 425: [pool www] child 17005 started [29-Jul-2023 21:53:54.512895] DEBUG: pid 15827, fpm_event_loop(), line 423: event module triggered 1 events [29-Jul-2023 21:53:55.441977] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 1 active children, 28 spare children, 29 running children. Spawning rate 1
我的困惑
现在完全搞不懂哪里出问题了,感觉像是curl/php-fpm/nginx之间有什么奇怪的交互bug?有没有大佬遇到过类似的情况,或者能给点排查方向?
备注:内容来源于stack exchange,提问作者Tio




