pd server角色切换导致max_execution_time参数失效

【 TiDB 使用环境】生产环境
【 TiDB 版本】V5.3.3
【遇到的问题:问题现象及影响】
集群设置max_execution_time=3000,凌晨4点35秒底层网络AVS组件进行有损升级,产生秒级网络抖动,导致PD Server角色进行了切换。角色切换后,集群产生了大量大于3秒的慢查,怀疑max_execute_time参数时效。
疑问点:
如果PD Server导致参数失效,那什么时候这个参数会生效?
【资源配置】
PD Server角色切换后产生大量慢查


通过慢查分析,都是获在PD 上获取时间戳耗时增加

参数设置

PD Server切换前,并没有大于3秒的慢查,如果存在,则会被max_execute_time参数进行KILL

【附件:截图/日志/监控】
PD Server日志:
[2024/03/14 04:35:08.926 +08:00] [WARN] [grpc_service.go:1282] [“trying to update gc safe point”] [old-safe-point=448358726736543772] [new-safe-point=0]
[2024/03/14 04:35:12.678 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=999.920525ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:12.678 +08:00] [INFO] [trace.go:145] [“trace[1056177004] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000001586s] [start=2024/03/14 04:35:11.678 +08:00] [end=2024/03/14 04:35:12.678 +08:00] [steps=”["trace[1056177004] ‘agreement among raft nodes before linearized reading’ (duration: 999.919526ms)"]”]
[2024/03/14 04:35:13.005 +08:00] [WARN] [raft.go:1011] [“c533035a5e8fb2a1 stepped down to follower since quorum is not active”]
[2024/03/14 04:35:13.005 +08:00] [INFO] [raft.go:700] [“c533035a5e8fb2a1 became follower at term 10”]
[2024/03/14 04:35:13.005 +08:00] [INFO] [node.go:331] [“raft.node: c533035a5e8fb2a1 lost leader c533035a5e8fb2a1 at term 10”]
[2024/03/14 04:35:13.038 +08:00] [INFO] [server.go:1316] [“etcd leader changed, resigns pd leadership”] [old-pd-leader-name=pd-10.252.122.33-2379]
[2024/03/14 04:35:13.038 +08:00] [INFO] [lease.go:153] [“stop lease keep alive worker”] [purpose=“pd leader election”]
[2024/03/14 04:35:13.040 +08:00] [INFO] [raft.go:1365] [“c533035a5e8fb2a1 no leader at term 10; dropping index reading msg”]
[2024/03/14 04:35:13.135 +08:00] [INFO] [raft.go:1365] [“c533035a5e8fb2a1 no leader at term 10; dropping index reading msg”]
[2024/03/14 04:35:13.168 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.239 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.240 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.33-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.241 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.31-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.270 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.371 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.473 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.574 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.574 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:373] [“coordinator is stopping”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:329] [“metrics are reset”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [server.go:111] [“region syncer has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:331] [“background jobs has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:351] [“statistics background jobs has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:369] [“coordinator has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:402] [“raftcluster is stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [tso.go:406] [“reset the timestamp in memory”]
[2024/03/14 04:35:14.039 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.240 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.311 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=999.954906ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:14.312 +08:00] [INFO] [trace.go:145] [“trace[438333249] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000021585s] [start=2024/03/14 04:35:13.311 +08:00] [end=2024/03/14 04:35:14.312 +08:00] [steps=”["trace[438333249] ‘agreement among raft nodes before linearized reading’ (duration: 999.953899ms)"]”]
[2024/03/14 04:35:14.327 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=1.000090807s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:14.328 +08:00] [INFO] [trace.go:145] [“trace[917517654] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000146307s] [start=2024/03/14 04:35:13.327 +08:00] [end=2024/03/14 04:35:14.328 +08:00] [steps=”["trace[917517654] ‘agreement among raft nodes before linearized reading’ (duration: 1.000089774s)"]”]
[2024/03/14 04:35:14.440 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.595 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.640 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.696 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.797 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]

我改成 bug 反馈了。辛苦完善下现象描述。

1 个赞

感觉你得先检查你现在其他节点到pd之间的网络是不是有问题

修改失败了吧 怎么还重定向leader了

会不会是出现了频繁的切换。。。。

看日志pd一秒多,实际上tidb到能正确获取tso可能不止一秒多

可以关注下这个 ISSUE:https://github.com/pingcap/tidb/issues/51893