switch region leader to specific leader due to kv return NotLeader 异常,重启选举

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:
    3.0.12
  • 【问题描述】:
    [2020/09/29 08:00:35.196 +08:00] [INFO] [region_cache.go:548] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=421103] [currIdx=1] [leaderStoreID=20530]

2020/09/29 08:00:56.268 +08:00] [INFO] [coprocessor.go:743] ["[TIME_COP_WAIT] resp_time:1.728901582s txnStartTS:419781058080538629 region_id:539987 store_addr: kv_wait_ms:1728"]
[2020/09/29 08:00:56.268 +08:00] [INFO] [coprocessor.go:743] ["[TIME_COP_WAIT] resp_time:1.728806652s txnStartTS:419781058080538629 region_id:539987 store_addr: kv_wait_ms:1728"]
[2020/09/29 08:00:56.269 +08:00] [INFO] [coprocessor.go:743] ["[TIME_COP_WAIT] resp_time:1.633645594s txnStartTS:419781058067431430 region_id:419943 store_addr: kv_wait_ms:1633"]
[2020/09/29 08:00:56.269 +08:00] [INFO] [coprocessor.go:743] ["[TIME_COP_WAIT] resp_time:1.634110326s txnStartTS:419781058067431430 region_id:419923 store_addr: kv_wait_ms:1633"]
[2020/09/29 08:00:56.269 +08:00] [INFO] [coprocessor.go:743] ["[TIME_COP_PROCESS] resp_time:1.743473423s txnStartTS:419781058080538629 region_id:355992 store_addr: kv_process_ms:269 scan_total_write:70069 scan_processed_write:70068 scan_total_data:70068 scan_processed_data:70068 scan_total_lock:1 scan_processed_lock:0 kv_wait_ms:1474"]
[2020/09/29 08:01:16.044 +08:00] [WARN] [main.go:166] [“config file conf/tidb.toml contained unknown configuration options: pessimistic-txn.ttl”]
[2020/09/29 08:01:16.044 +08:00] [INFO] [printer.go:41] [“Welcome to TiDB.”] [“Release Version”=v3.0.12] [“Git Commit Hash”=8c4696b3f3408c61dd7454204ddd67c93501609a] [“Git Branch”=heads/refs/tags/v3.0.12] [“UTC Build Time”=“2020-03-16 09:56:22”] [GoVersion=“go version go1.13 linux/amd64”] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2020/09/29 08:01:16.045 +08:00] [INFO] [printer.go:54] [“loaded config”] [config="{“host”:“0.0.0.0”,"advertise-a ]
[2020/09/29 08:01:16.045 +08:00] [INFO] [main.go:255] [“disable Prometheus push client”]
[2020/09/29 08:01:16.045 +08:00] [INFO] [store.go:69] [“new store”] [path=tikv:/]
[2020/09/29 08:01:16.045 +08:00] [INFO] [client.go:162] ["[pd] create pd client with endpoints"] [pd-address="[
]"]
[2020/09/29 08:01:16.045 +08:00] [INFO] [systime_mon.go:26] [“start system time monitor”]
[2020/09/29 08:01:16.047 +08:00] [INFO] [client.go:291] ["[pd] switch leader"] [new-leader=http://
:2379] [old-leader=]
[2020/09/29 08:01:16.047 +08:00] [INFO] [client.go:186] ["[pd] init cluster id"] [cluster-id=6819249905545873576]
[2020/09/29 08:01:16.049 +08:00] [INFO] [store.go:75] [“new store with retry success”]
[2020/09/29 08:01:16.057 +08:00] [INFO] [tidb.go:72] [“new domain”] [store=tikv-6819249905545873576] [“ddl lease”=45s] [“stats lease”=3s]
[2020/09/29 08:01:16.061 +08:00] [INFO] [ddl.go:434] ["[ddl] start DDL"] [ID=32e1cb7c-c5ee-4320-bb37-d0758c845b2b] [runWorker=true]
[2020/09/29 08:01:16.100 +08:00] [INFO] [ddl.go:422] ["[ddl] start delRangeManager OK"] [“is a emulator”=false]
[2020/09/29 08:01:16.100 +08:00] [INFO] [ddl_worker.go:120] ["[ddl] start DDL worker"] [worker=“worker 1, tp general”]
[2020/09/29 08:01:16.100 +08:00] [INFO] [ddl_worker.go:120] ["[ddl] start DDL worker"] [worker=“worker 2, tp add index”]
[2020/09/29 08:01:17.502 +08:00] [INFO] [domain.go:141] [“full load InfoSchema success”] [usedSchemaVersion=0] [neededSchemaVersion=8209] [“start time”=1.26001042s]
[2020/09/29 08:01:17.516 +08:00] [INFO] [domain.go:350] [“full load and reset schema validator”]
[2020/09/29 08:01:17.555 +08:00] [INFO] [gc_worker.go:155] ["[gc worker] start"] [uuid=5d35ca75b44000c]
[2020/09/29 08:01:17.559 +08:00] [INFO] [server.go:220] [“server is running MySQL protocol”] [addr=0.0.0.0:4000]
[2020/09/29 08:01:17.561 +08:00] [INFO] [http_status.go:255] [“for status and metrics report”] [“listening on addr”=0.0.0.0:10080]
[2020/09/29 08:01:17.578 +08:00] [INFO] [server.go:367] [“new connection”] [conn=1] [remoteAddr=
****:46512]

  1. switch region leader to specific leader due to kv return NotLeader 这个是 TiDB 发送 region 请求到 TiKV 上的时候,因为 region cache 过期导致 region 信息不正确,访问到的 region 不是 leader ,这个没有影响的,TiDB 内部会自动重试处理这个错误,找到最新的 leader 所在位置并重新发送请求
  2. tidb 重启这个问题
  • 可以看下 tidb_stderr.log 中有没有一些信息
  • grep -i ‘out of memory’ /var/log/messages 看下 tidb 有没有 OOM 的情况

但是后面又发生了 pd switch leader

查了下,没有看到OOM的情况

我们先梳理一下问题:

1、tidb log 中出现 region cache not leader 的信息向上面那个同学的回复,tidb 会 backoff 理论上不影响使用

2、重启的问题具体是指什么?是 tidb server 因某些异常自动重启了吗?

3、pd switch leader 可以 pd-ctl member 确认下当前的 leader,并且过滤下所有 pd 节点的信息看下 raft 选举的日志,确认下当前 pd leader 是否真正的发生过切换~