tikv频繁down,查看日志提示rpc相关错误

先通过 --force ,将集群升级到 v4.0.13,slow query 问题可以后面再解决。先保证集群版本是统一的,看监控截图,TiKV 还存在多版本情况,风险比较大。建议先升级,版本拉齐以后再排查其他问题。

后来有推出报错吗 ?退出报错会有一个 debug 日志,可以发出来,我们分析一下。

没有任何报错,查看进程也是一直存在的

结束进程重试后出现如图报错,是否要添加timeout参数

进行了多次尝试,添加timeout参数似乎不起作用,现在kv节点全部升到13版本了,但tidb还是4.0.6,请问接下来如何操作?

可以尝试 replay 将剩下的 TiDB 节点升级完成,操作方法见官方文档。

https://docs.pingcap.com/zh/tidb/stable/tiup-component-cluster-replay#tiup-cluster-replay


提示不存在replay指令

需要更新一下 tiup 的版本,先将 tiup 版本升级到最新的版本。



集群已经全部升级完毕了

现在集群情况正常了吧?

没有恢复,集群状态如图

236节点tikv日志如下
链接: 百度网盘-链接不存在 密码: 0hto
该节点的时间比正常时间晚8个小时,4点对应现在的12点,是最新日志

现在 TiKV 的服务状态是什么?看最新的日志,应该是正常输出的。没有异常报错抛出来。

等待了一会儿之后,目前集群状态基本正常,我们准备在使用过程中再继续观察。

好的哈 ~

集群升级似乎没有效果,236节点依然在运行一段时间后down了,且报错和oom情况依然存在

建议排查一下 slow query 或者并发大查询的 SQL 的是否存在,这个问题导致 TiKV oom 可能性会更大。可以参考一下 troubleshooting的相关的排查文档。

如果只是sql语句导致的oom,那如何解释每次只有236节点会down而不是其他节点呢,以下是store状态图,可以看到236节点的评分较低,能否从这一点解释分析呢
store
{
“count”: 6,
“stores”: [
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623896824,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033241606373301,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.184TiB”,
“used_size”: “2.632TiB”,
“leader_count”: 24073,
“leader_weight”: 2,
“leader_score”: 12036.5,
“leader_size”: 1899360,
“region_count”: 99337,
“region_weight”: 2,
“region_score”: 3798208,
“region_size”: 7596416,
“start_ts”: “2021-06-17T02:27:04Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:41.606373301Z”,
“uptime”: “37h53m37.606373301s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623897040,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033246028686561,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.701TiB”,
“used_size”: “2.056TiB”,
“leader_count”: 39611,
“leader_weight”: 2,
“leader_score”: 19805.5,
“leader_size”: 3176923,
“region_count”: 81485,
“region_weight”: 2,
“region_score”: 3216358.5,
“region_size”: 6432717,
“start_ts”: “2021-06-17T02:30:40Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:46.028686561Z”,
“uptime”: “37h50m6.028686561s”
}
},
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623897114,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033246068167980,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.523TiB”,
“used_size”: “2.143TiB”,
“leader_count”: 38076,
“leader_weight”: 2,
“leader_score”: 19038,
“leader_size”: 2990560,
“region_count”: 78377,
“region_weight”: 2,
“region_score”: 3095615,
“region_size”: 6191230,
“start_ts”: “2021-06-17T02:31:54Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:46.06816798Z”,
“uptime”: “37h48m52.06816798s”
}
},
{
“store”: {
“id”: 262397455,
“address”: “10.12.5.13:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.13:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623926310,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033242917893711,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “4.591TiB”,
“used_size”: “1.315TiB”,
“leader_count”: 19806,
“leader_weight”: 1,
“leader_score”: 19806,
“leader_size”: 1368229,
“region_count”: 50661,
“region_weight”: 1,
“region_score”: 3797779,
“region_size”: 3797779,
“start_ts”: “2021-06-17T10:38:30Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:42.917893711Z”,
“uptime”: “29h42m12.917893711s”
}
},
{
“store”: {
“id”: 268391998,
“address”: “10.12.5.119:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.119:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1623897869,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033249294417713,
“state_name”: “Up”
},
“status”: {
“capacity”: “320TiB”,
“available”: “287.9TiB”,
“used_size”: “1.287TiB”,
“leader_count”: 18547,
“leader_weight”: 1,
“leader_score”: 18547,
“leader_size”: 1460594,
“region_count”: 50159,
“region_weight”: 1,
“region_score”: 3798241,
“region_size”: 3798241,
“start_ts”: “2021-06-17T02:44:29Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:49.294417713Z”,
“uptime”: “37h36m20.294417713s”
}
},
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.13”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “a448d617f79ddf545be73931525bb41af0f790f3”,
“start_timestamp”: 1624003221,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1624033241733199445,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.529TiB”,
“used_size”: “1.75TiB”,
“leader_count”: 5,
“leader_weight”: 2,
“leader_score”: 2.5,
“leader_size”: 527,
“region_count”: 60365,
“region_weight”: 2,
“region_score”: 2436932.5,
“region_size”: 4873865,
“start_ts”: “2021-06-18T08:00:21Z”,
“last_heartbeat_ts”: “2021-06-18T16:20:41.733199445Z”,
“uptime”: “8h20m20.733199445s”
}
}
]
}

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。