You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

同一服务器下Nginx+PHP-FPM发起多curl请求偶发504 Gateway Timeout问题排查求助

同一服务器下Nginx+PHP-FPM发起多curl请求偶发504 Gateway Timeout问题排查求助

各位大佬,这个问题已经折腾我5个小时了,快疯了😭。我们的一台服务器突然开始偶发504网关超时错误,场景有点特殊,我把细节都列出来,求帮忙分析下!

服务器环境

  • 系统:Debian 10(运行在Proxmox 7上)
  • Nginx版本:1.14
  • PHP版本:7.2.34 (39+020230609.84+debian101.gbpf63844)
  • curl版本:7.64.0

问题复现场景

两台虚拟主机都在同一台机器上:

  1. 虚拟主机VH1(example.com/sd.php):代码极其简单,仅输出字符串:
<?php
echo 'Response';
  1. 虚拟主机VH2(example1.com/sd.php):循环2次用curl请求VH1的接口:
<?php
$host= 'https://example.com/sd.php';
for ( $i = 0; $i<2; $i++) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $host);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $season_data = curl_exec($ch);
    if (curl_errno($ch)) {
        print_r(curl_error($ch));
        print "Error: " . curl_error($ch);
        exit();
    }
    curl_close($ch);
    echo "Responded" . $i."<br />";
}

现在的问题是:在浏览器访问VH2的sd.php,每刷新5-9次就会出现一次504 Gateway Timeout。

相关日志信息

VH1的Nginx错误日志

出现504时,VH1的日志会报:

2023/07/29 21:16:04 [error] 27138#27138: *851 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example.com, request: "GET /sd.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "example.com"

PHP-FPM状态与日志

一开始以为是PHP-FPM进程不足,但监控显示进程完全充足:

[29-Jul-2023 21:26:32.662531] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1
[29-Jul-2023 21:26:33.663691] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1
[29-Jul-2023 21:26:34.664941] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 0 active children, 25 spare children, 25 running children. Spawning rate 1

PHP-FPM关键配置:

listen = /run/php/php7.2-fpm.sock
request_terminate_timeout = 30s

(之前用的TCP socket,今天改成unix sock后问题依旧)

启用慢日志后,日志指向VH2的sd.php里的curl_exec调用:

[29-Jul-2023 21:23:24.315874]  [pool www] pid 15840
script_filename = /var/www/html/vh2/public/sd.php
[0x00007fb8fd21e110] curl_exec() (...)/vh1/public/sd.php:18

VH2的Nginx日志

VH2的Nginx即使开了debug级别的错误日志,出现504时也没有任何记录。

Nginx相关配置

我已经调大了超时时间,但问题依旧:

client_body_timeout   10;
client_header_timeout 10;
keepalive_timeout     30;
send_timeout          10;
proxy_connect_timeout  600s;
proxy_send_timeout  600s;
proxy_read_timeout  600s;
fastcgi_send_timeout 600s;
fastcgi_read_timeout 600s;
fastcgi_temp_file_write_size 256k;

压力测试结果

用ab工具直接压VH1的接口,不管是本地还是远程,结果都完全正常,没有失败请求:

tio ~  $ ab -n 5000 -c 100 https://example.com/sd.php
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests

Server Software:        pp7-2
Server Hostname:        example.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        example.com
Document Path:          /sd.php
Document Length:        5 bytes

Concurrency Level:      100
Time taken for tests:   9.910 seconds
Complete requests:      5000
Failed requests:        0
Total transferred:      1940000 bytes
HTML transferred:       25000 bytes
Requests per second:    504.54 [#/sec] (mean)
Time per request:       198.200 [ms] (mean)
Time per request:       1.982 [ms] (mean, across all concurrent requests)
Transfer rate:          191.17 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      113  136  23.5    133     452
Processing:    41   57  20.3     52     425
Waiting:       41   56  19.5     52     424
Total:        161  192  34.0    186     594

Percentage of the requests served within a certain time (ms)
  50%    186
  66%    192
  75%    196
  80%    199
  90%    208
  95%    219
  98%    261
  99%    334
 100%    594 (longest request)

进一步测试:用stream_socket替代curl

我换了一段用stream_socket发起请求的代码(sd2.php),结果还是偶发504:

<?php
function connect($host, $port, $timeout = 1) {
    $conn_str = "tcp://{$host}:{$port}";
    $opts = STREAM_CLIENT_CONNECT | STREAM_CLIENT_ASYNC_CONNECT | STREAM_CLIENT_PERSISTENT;
    $sock = stream_socket_client($conn_str, $errno, $errstr, $timeout, $opts);
    return $sock;
}
$sock = connect("google.com", 80);
$req = "GET / HTTP/1.0\r\nHost: www.google.com\r\nAccept: */*\r\n\r\n";
$len = fwrite($sock, $req);
$data = stream_get_contents($sock);
echo $data;

此时VH2的Nginx错误日志:

2023/07/29 21:52:34 [error] 27138#27138: *35680 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com"
2023/07/29 21:53:12 [info] 27138#27138: *35925 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too (104: Connection reset by peer) while sending request to upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com"
2023/07/29 21:54:14 [error] 27138#27138: *36093 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 95.94.77.55, server: example1.com, request: "GET /sd2.php HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.2-fpm.sock:", host: "www.example1.com"

对应的PHP-FPM日志显示进程超时被终止:

[29-Jul-2023 21:53:54.440363] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 2 active children, 27 spare children, 29 running children. Spawning rate 1
[29-Jul-2023 21:53:54.510612] WARNING: pid 15827, fpm_request_check_timed_out(), line 278: [pool www] child 16562, script '/var/www/html/vh2//public/sd2.php' (request: "GET /sd2.php") execution timed out (42.699082 sec), terminating
[29-Jul-2023 21:53:54.512034] DEBUG: pid 15827, fpm_got_signal(), line 75: received SIGCHLD
[29-Jul-2023 21:53:54.512087] WARNING: pid 15827, fpm_children_bury(), line 256: [pool www] child 16562 exited on signal 15 (SIGTERM) after 702.886545 seconds from start
[29-Jul-2023 21:53:54.512861] NOTICE: pid 15827, fpm_children_make(), line 425: [pool www] child 17005 started
[29-Jul-2023 21:53:54.512895] DEBUG: pid 15827, fpm_event_loop(), line 423: event module triggered 1 events
[29-Jul-2023 21:53:55.441977] DEBUG: pid 15827, fpm_pctl_perform_idle_server_maintenance(), line 378: [pool www] currently 1 active children, 28 spare children, 29 running children. Spawning rate 1

我的困惑

现在完全搞不懂哪里出问题了,感觉像是curl/php-fpm/nginx之间有什么奇怪的交互bug?有没有大佬遇到过类似的情况,或者能给点排查方向?

备注:内容来源于stack exchange,提问作者Tio

火山引擎 最新活动