升级TIDB集群后tidb一直报错 write: connection reset by peer

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】
升级集群版本v4.0.9 → v5.4.3 升级后 tidb日志大量报错:

[stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.400 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.55:21118: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.444 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.55:21123: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.507 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.54:40415: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.519 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.54:40416: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]

搜了下asktug上也有很多这个问题。大部分都是关闭了负载均衡的健康检查,但好像都没有最终的解决方案。
asktug上有大佬说修改haproxy的探活端口,这个具体也没说怎么实现。 官方文档中对应的haproxy配置也没有说明~~~

我在v4.0.9中没遇到这问题,升级到v5.4.3遇到了。 我想问下这是BUG吗?在v6.5.1中有没有解决呢? 因为我的目标版本是v6.5.1

以下是我的haproxy配置文件,也是参照的官方的配置

# cat /etc/haproxy/haproxy.cfg
global                                     
   log         127.0.0.1 local2            
   chroot      /var/lib/haproxy            
   pidfile     /var/run/haproxy.pid        
   maxconn     4000                        
   user        haproxy                     
   group       haproxy                     
   nbproc      10                          
   daemon                                  
   stats socket /var/lib/haproxy/stats     

defaults                                   
   log global                              
   retries 2                               
   timeout connect  2s                     
   timeout client 30000s                   
   timeout server 30000s                   

listen admin_stats                         
   bind 192.168.241.54:18080                       
   mode http                               
   option httplog                          
   maxconn 10                              
   stats refresh 30s                       
   stats uri /haproxy                      
   stats realm HAProxy                     
   stats auth admin:UXnxFu5Mxxxxxxxxxxxx
   stats hide-version                      
   stats  admin if TRUE                    

listen tidb-xxxxx
   bind 0.0.0.0:14000
   mode tcp                                
   balance leastconn                       
   server tidb-71 192.168.241.71:4000 send-proxy  check inter 2000 rise 2 fall 3
   server tidb-72 192.168.241.72:4000 send-proxy  check inter 2000 rise 2 fall 3
   server tidb-73 192.168.241.73:4000 send-proxy  check inter 2000 rise 2 fall 3

v4.0.9 可以直接升级到v6.5.1

1 个赞

我知道可以直接升级,这问题在v6.5.1中解决了吗?

image

我看了看我日志,没有这个错误

1 个赞

这个帖子里面大佬的回复解决了我的问题。
如下:
Using the “port” parameter, it becomes possible to use a different port to send health-checks. On some servers, it may be desirable to dedicate a port to a specific component able to perform complex tests which are more suitable to health-checks than the application. It is common to run a simple script in inetd for instance. This parameter is ignored if the “check” parameter is not set. See also the “addr” parameter.

server tidb-1 192.168.0.1:4000 port 10080 check ??

我的配置修改:添加port 10080参数指定探测端口

   server tidb-71 192.168.241.71:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3
   server tidb-72 192.168.241.72:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3
   server tidb-73 192.168.241.73:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3

haproxy有指定监测端口的方式:
http://docs.haproxy.org/1.7/configuration.html#5.2-port

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。