Kubernetes同一命名空间内服务连接超时问题排查求助
我部署了两个应用:config-server和作为业务逻辑应用的alpha-app,后者需从前者获取配置数据。两者均运行在本地Kubernetes环境的kubernetes-learning命名空间中,但alpha-app连接config-server时出现连接超时异常,进而导致livenessProbe(存活探针)和readinessProbe(就绪探针)失败。以下是两个应用的YAML配置文件及异常堆栈信息。
config-server.yaml配置
# Config server apiVersion: apps/v1 kind: Deployment metadata: name: kubernetes-learning-config-server namespace: kubernetes-learning labels: app: kubernetes-learning-config-server spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 selector: matchLabels: app: kubernetes-learning-config-server template: metadata: name: kubernetes-learning-config-server labels: app: kubernetes-learning-config-server spec: containers: - name: kubernetes-learning-config-server image: ghcr.io/kubernetes/learning.config-server imagePullPolicy: Always ports: - containerPort: 8888 protocol: TCP - containerPort: 48888 protocol: TCP env: - name: BPL_JVM_THREAD_COUNT value: "50" - name: BPL_DEBUG_ENABLED value: "true" - name: BPL_DEBUG_PORT value: "48888" - name: GITHUB_CONFIG_DATA_URL value: https://github.com/kubernetes/config-data - name: GITHUB_CONFIG_DATA_USERNAME value: github_user - name: GITHUB_CONFIG_DATA_PERSONAL_ACCESS_TOKEN value: github_sampletoken livenessProbe: httpGet: path: /alpha-app/local port: 8888 initialDelaySeconds: 30 periodSeconds: 20 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3 readinessProbe: httpGet: path: /alpha-app/local port: 8888 initialDelaySeconds: 30 periodSeconds: 20 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3 restartPolicy: Always # Expose Config server --- apiVersion: v1 kind: Service metadata: name: kubernetes-learning-config-server labels: app: kubernetes-learning-config-server spec: type: ClusterIP selector: app: kubernetes-learning-config-server ports: - protocol: TCP port: 80 targetPort: 8888
alpha-app.yaml配置
# app applications apiVersion: apps/v1 kind: Deployment metadata: name: kubernetes-learning-app namespace: kubernetes-learning labels: app: kubernetes-learning-app spec: replicas: 1 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 selector: matchLabels: app: kubernetes-learning-app template: metadata: name: kubernetes-learning-app labels: app: kubernetes-learning-app spec: containers: - name: kubernetes-learning-alpha-app image: ghcr.io/kubernetes/learning.alpha-app imagePullPolicy: Always ports: - containerPort: 8441 protocol: TCP - containerPort: 48441 protocol: TCP env: - name: BPL_JVM_THREAD_COUNT value: "50" - name: BPL_DEBUG_ENABLED value: "true" - name: BPL_DEBUG_PORT value: "48441" - name: SPRING_PROFILES_ACTIVE value: kube - name: SPRING_CLOUD_CONFIG_FAIL_FAST value: "true" - name: SPRING_CLOUD_CONFIG_RETRY_INITIAL_INTERVAL value: "1000" - name: SPRING_CLOUD_CONFIG_RETRY_MAX_INTERVAL value: "10000" - name: SPRING_CLOUD_CONFIG_RETRY_MULTIPLIER value: "2" - name: SPRING_CLOUD_CONFIG_RETRY_MAX_ATTEMPTS value: "5" - name: SPRING_CLOUD_CONFIG_URI value: http://kubernetes-learning-config-server:8888 livenessProbe: httpGet: path: /info port: 8441 initialDelaySeconds: 60 timeoutSeconds: 15 periodSeconds: 30 successThreshold: 1 failureThreshold: 3 readinessProbe: httpGet: path: /info port: 8441 initialDelaySeconds: 60 timeoutSeconds: 15 periodSeconds: 30 successThreshold: 1 failureThreshold: 3 restartPolicy: Always # Expose Config server --- apiVersion: v1 kind: Service metadata: name: kubernetes-learning-app labels: app: kubernetes-learning-app spec: type: ClusterIP selector: app: kubernetes-learning-app ports: - protocol: TCP port: 80 targetPort: 8441
异常堆栈信息
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://kubernetes-learning-config-server:8888/alpha-app/kube": Connect timed out at org.springframework.web.client.RestTemplate.createResourceAccessException(RestTemplate.java:926) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:906) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:801) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:683) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.cloud.config.client.ConfigServerConfigDataLoader.getRemoteEnvironment(ConfigServerConfigDataLoader.java:349) ~[spring-cloud-config-client-4.2.0.jar:4.2.0] at org.springframework.cloud.config.client.ConfigServerConfigDataLoader.doLoad(ConfigServerConfigDataLoader.java:130) ~[spring-cloud-config-client-4.2.0.jar:4.2.0] ... 37 common frames omitted Caused by: java.net.SocketTimeoutException: Connect timed out at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(Unknown Source) ~[na:na] at java.base/sun.nio.ch.NioSocketImpl.connect(Unknown Source) ~[na:na] at java.base/java.net.Socket.connect(Unknown Source) ~[na:na] at java.base/sun.net.NetworkClient.doConnect(Unknown Source) ~[na:na] at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source) ~[na:na] at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source) ~[na:na] at java.base/sun.net.www.http.HttpClient.<init>(Unknown Source) ~[na:na] at java.base/sun.net.www.http.HttpClient.New(Unknown Source) ~[na:na] at java.base/sun.net.www.http.HttpClient.New(Unknown Source) ~[na:na] at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source) ~[na:na] at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source) ~[na:na] at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source) ~[na:na] at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source) ~[na:na] at org.springframework.http.client.SimpleClientHttpRequest.executeInternal(SimpleClientHttpRequest.java:79) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.http.client.AbstractStreamingClientHttpRequest.executeInternal(AbstractStreamingClientHttpRequest.java:71) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:81) ~[spring-web-6.2.1.jar:6.2.1] at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:900) ~[spring-web-6.2.1.jar:6.2.1] ... 41 common frames omitted
问题排查与修复方案
1. 服务端口匹配错误
alpha-app的环境变量SPRING_CLOUD_CONFIG_URI配置的是http://kubernetes-learning-config-server:8888,但config-server的Service暴露的集群访问端口是80,而非容器端口8888。Kubernetes Service的port字段是集群内访问的端口,targetPort才是映射到容器的端口。
修复:将alpha-app的SPRING_CLOUD_CONFIG_URI改为http://kubernetes-learning-config-server:80,或者修改config-server的Service,把port字段设为8888,保持与容器端口一致。
2. config-server探针路径有效性验证
config-server的存活/就绪探针路径是/alpha-app/local,但alpha-app请求的是对应kube环境的/alpha-app/kube路径。需确认config-server是否能正确响应/alpha-app/local路径,若该路径不存在,会导致config-server自身探针失败,无法对外提供服务。
验证方式:进入任意一个config-server容器,执行curl http://localhost:8888/alpha-app/local,检查是否返回正常的配置数据。若路径错误,需调整探针路径为正确的配置端点。
3. 网络连通性手动验证
若上述配置修正后仍有问题,可在alpha-app容器内执行以下命令排查网络:
- 解析服务域名:
nslookup kubernetes-learning-config-server,确认能正常解析到Service的ClusterIP - 测试端口连通性:
telnet kubernetes-learning-config-server 80(或修改后的端口),确认端口能正常建立连接
4. 启动顺序与重试配置优化
alpha-app的SPRING_CLOUD_CONFIG_RETRY_MAX_ATTEMPTS设为5次,总重试时长约31秒(1+2+4+8+16),而探针初始延迟是60秒,理论上足够,但如果config-server启动过慢,仍可能导致连接超时。可适当调大重试次数(如设为10)或延长探针初始延迟(如设为90秒)。
内容的提问来源于stack exchange,提问作者JMD




