7.1.4升级到7.1.5之后kv节点出错

【 TiDB 使用环境】测试
【 TiDB 版本】7.1.5
【复现路径】从7.1.4升级到7.1.5
【遇到的问题:问题现象及影响】
kv节点出错,集群状态正常:


但是在查询线程的时候发现一直在提交:

然后查看其中一个kv节点日志:

> [2024/06/11 04:04:01.635 -04:00] [INFO] [<unknown>] ["Subchannel 0x7f08e136c580: Retry in 1000 milliseconds"]
> [2024/06/11 04:04:01.635 -04:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=192.168.0.241:20161]
> [2024/06/11 04:04:01.638 -04:00] [WARN] [raft_client.rs:199] ["send to 192.168.0.241:20161 fail, the gRPC connection could be broken"]
> [2024/06/11 04:04:01.638 -04:00] [ERROR] [transport.rs:163] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:208]: RaftClient send fail\")"]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [transport.rs:144] ["resolve store address ok"] [addr=192.168.0.241:20161] [store_id=1]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [raft_client.rs:48] ["server: new connection with tikv endpoint"] [addr=192.168.0.241:20161]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1718093041.638676386\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:192.168.0.241:20161\"}"]

全是这种错误,写入非常快,几百个G的日志
查看241这台日志:

[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:1699] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 { changes { peer { id: 144727509 store_id: 1 } } changes { change_type: AddLearnerNode peer { id: 144053190 store_id: 874002 role: Learner } } }"] [index=147] [term=85] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:2292] ["exec ConfChangeV2"] [epoch="conf_ver: 96 version: 445"] [kind=EnterJoint] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:2473] ["conf change successfully"] ["current region"="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 98 version: 445 } peers { id: 144053190 store_id: 874002 role: DemotingVoter } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 role: IncomingVoter }"] ["original region"="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 96 version: 445 } peers { id: 144053190 store_id: 874002 } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 role: Learner }"] [changes="[peer { id: 144727509 store_id: 1 }, change_type: AddLearnerNode peer { id: 144053190 store_id: 874002 role: Learner }]"] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.417 -04:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {144727509, 144558754, 144718902} }, outgoing: Configuration { voters: {144558754, 144718902, 144053190} } }, learners: {}, learners_next: {144053190}, auto_leave: false }"] [raft_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:1699] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 {}"] [index=148] [term=85] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:2292] ["exec ConfChangeV2"] [epoch="conf_ver: 98 version: 445"] [kind=LeaveJoint] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:2503] ["leave joint state successfully"] [region="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 100 version: 445 } peers { id: 144053190 store_id: 874002 role: Learner } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 }"] [peer_id=144558754] [region_id=144053189]

感觉没多大问题,现在如何解决,已经重启过一次了,还是一样
很奇怪我配置文件都是20160这端口,为啥它要访问20161这端口

看看这个tikv的状态和日志

没有这个192.168.0.241:20161,只有192.168.0.241:20160,集群里面查看的也是192.168.0.241:20160,不知道哪儿出现的192.168.0.241:20161

原来做过扩缩容?pd-ctl store看下

没有扩容,只是安装过其他集群:

这个完整输出是什么? ps -ef|grep tikv看下这个节点上的进程

还有其他节点信息,但是这个节点只有这一个store

[root@node2 tikv-20160]#  ps -ef|grep tikv
tidb      1757     1 41 02:52 ?        00:41:56 bin/tikv-server --addr 0.0.0.0:20160 --advertise-addr 192.168.0.241:20160 --status-addr 0.0.0.0:20180 --advertise-status-addr 192.168.0.241:20180 --pd 192.168.0.246:2379 --data-dir /home/tidb/Data/data/tikv-20160 --config conf/tikv.toml --log-file /home/tidb/Data/log/tikv-20160/tikv.log
root     27214   496  0 04:32 pts/0    00:00:00 grep --color=auto tikv

找到原因了,是另外一个集群,我已经停止了,但是事务挂起那个依旧存在,导致一直没法写入数据

这个集群的日志里有另一个集群节点的信息,你的集群元数据估计已经乱了了,2个集群的Pd节点ip端口都是一样 的吧

1 个赞

并不一样,都是错开端口的,升级之前都是一起运行,已经一起运行好几年了

而且查询什么的都是可以用的,就是写入一个表有问题,一直卡住的,现在就是不知道为啥一直卡

日志输出目录没有隔离吗

隔离了的

如果 tikv 识别到了这个 tikv。说明两边打通了,元数据乱了。你能用的时候应该是 1 副本。然后加进来之后应该变成了多副本,这个事情你再关闭 我理解是有问题的。你要么重建集群。要么试试 Online recover,干掉对应节点:https://docs.pingcap.com/zh/tidb/stable/online-unsafe-recovery

1 个赞