Docker Swarm服务名称DNS解析失败,但虚拟IP和任务IP可正常访问的排查求助
这个问题被标记为不适合StackOverflow,因为它偏网络方向。如果有更合适的提问平台,麻烦告知我~
TLDR;
简单总结一下:在和nginx服务同栈的服务容器内执行以下操作:
nslookup nginx # 解析到nginx服务的虚拟IP nslookup tasks.nginx # 解析到nginx容器的正确IP(10.0.17.20) ping 10.0.17.20 # 可以ping通 ping nginx # 无法ping通 curl http://10.0.17.20 # 可以正常访问 curl http://nginx # 无法访问
不过curl http://tasks.nginx是可以正常解析并访问的。我的节点都是LXC容器,这种情况在Digital Ocean的VM上不会出现。
问题详情
我在Docker Swarm环境里遇到了容器内通过服务名称进行DNS解析失败的问题,先给大家看一下我的堆栈配置:
version: "3.9" networks: elk7: name: elk7 driver: overlay attachable: true ipam: driver: default config: - subnet: "10.0.17.0/24" services: setup: ... networks: - elk7 es01: # 部署在管理节点1上 ... networks: - elk7 # ... es02/es03等其他服务 nginx: # 部署在管理节点2上 ... networks: - elk7
Manager Node 1
执行docker network inspect elk7可以看到,这个节点上有es01服务的容器(我猜应该只能看到当前节点的容器?):
"9c1a019a5c83c466615819b5401bbb0e58c31f078a96f13ed4af3905c837d565": { "Name": "elk7_es01.1.meux9ctcmnfwdejiqmxftyeq8", "EndpointID": "e0b102827e93eb3d0439513778c308eeb1201cd1e8e252f1361692c2f9981cc5", "MacAddress": "02:42:0a:00:11:11", "IPv4Address": "10.0.17.17/24", "IPv6Address": "" },
IPAM部分的信息也放上来供参考:
"IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "10.0.17.0/24", "Gateway": "10.0.17.1" } ] },
Manager Node 2
我进入Nginx容器(执行docker container exec -it <container id> bash)后,无法通过服务名称访问es01,但通过IP地址是可以正常访问的:
ping服务名称的结果:
root@2d6f42945a18:/# ping es01 PING es01 (10.0.17.16) 56(84) bytes of data. From 2d6f42945a18 (10.0.17.20) icmp_seq=1 Destination Host Unreachable From 2d6f42945a18 (10.0.17.20) icmp_seq=2 Destination Host Unreachable From 2d6f42945a18 (10.0.17.20) icmp_seq=3 Destination Host Unreachable --- es01 ping statistics --- 5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4090ms
pingIP地址的结果:
root@2d6f42945a18:/# ping 10.0.17.17 PING 10.0.17.17 (10.0.17.17) 56(84) bytes of data. 64 bytes from 10.0.17.17: icmp_seq=1 ttl=64 time=0.220 ms 64 bytes from 10.0.17.17: icmp_seq=2 ttl=64 time=0.128 ms --- 10.0.17.17 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1016ms rtt min/avg/max/mdev = 0.128/0.174/0.220/0.046 ms
执行nslookup或者dig是可以得到解析结果的——也就是说服务名称es01是能被解析的:
nslookup结果:
root@2d6f42945a18:/# nslookup es01 Server: 127.0.0.11 Address: 127.0.0.11#53 Non-authoritative answer: Name: es01 Address: 10.0.17.16
dig结果:
root@2d6f42945a18:/# dig es01 ; <<>> DiG 9.18.24-1-Debian <<>> es01 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17322 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;es01. IN A ;; ANSWER SECTION: es01. 600 IN A 10.0.17.16 ;; Query time: 0 msec ;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP) ;; WHEN: Tue Mar 26 07:10:16 UTC 2024 ;; MSG SIZE rcvd: 42
这个10.0.17.16的IP是es01服务的虚拟IP,我通过docker service inspect elk7_es01可以看到:
... "Endpoint": { "Spec": { "Mode": "vip" }, "VirtualIPs": [ { "NetworkID": "rdtgyz97aahhsrlwz8u2mm8fy", "Addr": "10.0.17.16/24" } ] }
我现在搞不懂为什么通过服务名称无法访问服务任务(容器),明明容器内是可以解析到服务的虚拟IP的。可能的问题出在哪里呢?我的Swarm节点都是通过Proxmox 7.0.11配置的Ubuntu 20.04 LXC容器,宿主机IP段是10.8.66.0/24(不确定这个有没有影响)。
我也看过一些类似的问题,但给出的解决方法对我这个情况没用。另外有个问题提到可以通过nslookup tasks.es01来显式解析服务任务的DNS,我试了一下,确实能得到正确的容器IP:
root@2d6f42945a18:/# nslookup tasks.es01 Server: 127.0.0.11 Address: 127.0.0.11#53 Non-authoritative answer: Name: tasks.es01 Address: 10.0.17.17
备注:内容来源于stack exchange,提问作者Zach Smith




