Tikv高可用测试的一些疑问

系统版本: CentOS Linux release 7.3.1611 (Core) tidb 版本:2.1.14 tidb集群服务器列表: 192.168.144.199 (ansible+monitor) 10.111.9.248 (kv1) 10.111.22.39(kv2) 10.111.9.251(kv3) 10.111.22.49(kv4) 10.111.10.7(kv5) 10.111.9.247(pd1) 10.111.22.40(pd2) 10.111.9.253(pd3) 10.111.9.254(tidb1) 10.111.22.41(tidb2)

问题:手动关闭tikv4,tikv5 服务器 ,tidb集群服务不可用

tikv.log日志信息: 觉: kv日志是 2019/11/04 16:49:27.248 INFO raft.rs:858: [region 8359] 12920 received MsgRequestPreVoteResponse from 12920 at term 8 2019/11/04 16:49:27.248 INFO raft.rs:832: [region 8359] 12920 [logterm: 8, index: 354680] sent MsgRequestPreVote request to 8360 at term 8 2019/11/04 16:49:27.248 INFO raft.rs:832: [region 8359] 12920 [logterm: 8, index: 354680] sent MsgRequestPreVote request to 8362 at term 8 2019/11/04 16:49:27.248 INFO transport.rs:258: resolve store 5 address ok, addr 10.111.10.7:20160 2019/11/04 16:49:27.248 INFO raft_client.rs:54: server: new connection with tikv endpoint: 10.111.10.7:20160 2019/11/04 16:49:27.248 INFO raft_client.rs:54: server: new connection with tikv endpoint: 10.111.10.7:20160 2019/11/04 16:49:27.248 INFO raft_client.rs:54: server: new connection with tikv endpoint: 10.111.10.7:20160 2019/11/04 16:49:27.248 INFO raft_client.rs:54: server: new connection with tikv endpoint: 10.111.10.7:20160 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped) 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped) 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped) 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped) 2019/11/04 16:49:27.249 ERRO raft_client.rs:176: server: drop conn with tikv endpoint 10.111.10.7:20160 flush conn error: SendError("…") 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped) 2019/11/04 16:49:27.249 WARN raft_client.rs:92: send raftmessage to 10.111.10.7:20160 failed: Grpc(RemoteStopped

tidb.log日志信息: [2019/11/04 16:48:22.337 +08:00] [INFO] [coprocessor.go:723] ["[TIME_COP_PROCESS] resp_time:501.997046ms txnStartTS:412315101059088385 region_id:16 store_addr:10.111.22.39:20160 backoff_ms:13700 backoff_types:[regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss]"] [2019/11/04 16:48:22.374 +08:00] [INFO] [coprocessor.go:723] ["[TIME_COP_PROCESS] resp_time:501.803919ms txnStartTS:412315100796944385 region_id:16 store_addr:10.111.22.39:20160 backoff_ms:14700 backoff_types:[regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss]"] [2019/11/04 16:48:22.476 +08:00] [INFO] [coprocessor.go:723] ["[TIME_COP_PROCESS] resp_time:502.089507ms txnStartTS:412315101622697986 region_id:16 store_addr:10.111.22.39:20160 backoff_ms:11700 backoff_types:[regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss,regionMiss]"]

在 5 服务器三副本的架构下,tidb 通过 label 设置可以保证每个服务器不同时存在 region 的两个副本。但是 在 5 台服务器故障两台的情况下,是有概率一个 region 的两个副本都坏掉的,需要人为干预了

那是不是设置副本数和kv节点数一样,就能避免这个问题?

5 副本 + 5 节点是没有问题的,但空间开销太大。

那这样的话,副本数和kv节点数应该保持一个怎么样的平衡?

这边的经验一般是 3 副本或 5 副本,tikv 实例数多一点,触发问题的几率就会相应变少。每个 tikv 的数据盘大小不超过 4tb。

ok~~~ 还有个问题,前面实验之后,两节点宕机,tidb集群不可用,我当这两个节点无法恢复,尝试了对故障节点做下线处理,只留3个kv节点,发现tidb集群还是无法恢复服务,类似这样情况,人工干预应该怎么处理

可以再文档中搜索 tikv-ctl 针对不一致的 region 做修复

官方手册查阅了,是 强制 Region 从多副本失败状态恢复服务 这个吗? tikv-ctl两种运行模式,其中的本地模式是指在tikv服务器上执行?但是好像没有找到tikv-ctl客户端

tikc-ctl 只支持本地执行,本地执行是指在 tikv-ctl 这个服务器上执行 unsafe-recover remove-fail-stores 命令,如果本机没有 tikv-ctl 这个命令,可以在 tidb-ansible 中的 tidb-ansible/resources/bin 目录下有相关的命令,可以拷贝该命令到目标 tikv server 后,再使用~~~