Operator部署的TiFlash 一直OOM,无法正常启动,怎么操作可以不影响已同步表的前提下把这个坏掉的节点修好

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】

生产环境,通过tidb-operator部署的tiflash
【概述】 场景 + 问题概述
同步表到tiflash过程中OOM(有已同步完的表)
【背景】 做过哪些操作
同步表到tiflash过程中OOM(有已同步完的表)
【现象】 业务和数据库现象
同步表到tiflash过程中OOM(有已同步完的表)
【问题】 当前遇到的问题
同步表到tiflash过程中OOM(有已同步完的表)


【业务影响】

【 TiDB 版本】
tiflash:4.0.8

问题同:TiFlash OOM 起不来(生产节点) - #16,来自 Lawrence

想咨询怎么下线这个tiflash,这个是tiflash-0,缩这个的话,所有的节点就都没了。有什么命令可以单独下线tiflash-0,然后清空tiflash-0上的这些数据,再重建?
帮忙给出一套可行的操作方案。

  1. 是只有一个tiflash 节点吗? 那么是的,清理数据,缩容,再扩容。
  2. 如果有其他节点,麻烦收集以下信息。
    在左上角选择其中一个发生 OOM 的节点,导出下面的监控:
    TiFlash-Summary
    TiFlash-Proxy-Details
    Node_exporter
    以及该 OOM 的节点的日志 tiflash.log/tiflash_tikv.log/tiflash_error.log

2个tiflash节点

导出这个数据的目的是什么?定位问题还是看数据分布情况?监控数据我们这边看起来比较麻烦,问题应该就是上面我贴图的问题。

你已经定位问题了?
目前是 k8s 环境,需要删除tiflash-0 pod ?

根据历史搜的帖子,应该差不多是同一个问题。
是想删除tiflash-0,但是希望不要影响已同步的表的读操作。
还有个tiflash-1,能正常跑

请问 operator 版本和 tidb 版本是多少?

tiflash.zip (1.8 MB)
也可以帮忙看看是不是这个问题。
希望还是先指导一下怎么把tiflash0重建。
后续再帮忙解答下未来没升级版本之前,怎么才能避免再次出现这个问题。

tidb-operator是v1.1.6,tidb的版本是v4.0.8

参考这个文档
https://docs.pingcap.com/zh/tidb-in-kubernetes/dev/advanced-statefulset#操作-tidbcluster-对象指定-pod-进行缩容

:+1:

我缩完以后,pvc得删了吧,否则起来的pod还是会不断的oom
如果删除的话,tiflash-0上的数据就丢了是吗?
已同步表还得重新同步是吗?
tiflash2个副本是怎么存储数据的?是不是当tiflash-0挂了以后,数据已经全部重新同步到了tiflash-1上?

支持的annotation没有tiflash啊:sweat_smile:

我强删了tiflash-0的pod和pvc,然后tiflash-0还是起不来:
tiflash

[2021/08/31 11:02:54.066 +00:00] [INFO] [util.rs:419] ["connecting to PD endpoint"] [endpoints=http://tidb-pkx070i9ww-pd-2.tidb-pkx070i9ww-pd-peer.tidb-pkx070i9ww.svc:2379]
[2021/08/31 11:02:54.067 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f930fe54870 for subchannel 0x7f930fe3b540"]
[2021/08/31 11:02:54.068 +00:00] [INFO] [util.rs:419] ["connecting to PD endpoint"] [endpoints=http://tidb-pkx070i9ww-pd-1.tidb-pkx070i9ww-pd-peer.tidb-pkx070i9ww.svc:2379]
[2021/08/31 11:02:54.070 +00:00] [INFO] [util.rs:484] ["connected to PD leader"] [endpoints=http://tidb-pkx070i9ww-pd-1.tidb-pkx070i9ww-pd-peer.tidb-pkx070i9ww.svc:2379]
[2021/08/31 11:02:54.070 +00:00] [INFO] [util.rs:190] ["heartbeat sender and receiver are stale, refreshing ..."]
[2021/08/31 11:02:54.070 +00:00] [WARN] [util.rs:209] ["updating PD client done"] [spend=4.603743ms]
[2021/08/31 11:02:54.072 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.073 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.074 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.075 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.076 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.077 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.078 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.079 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.081 +00:00] [ERROR] [util.rs:347] ["request failed"] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]
[2021/08/31 11:02:54.081 +00:00] [FATAL] [server.rs:620] ["failed to start node: Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"duplicated store address: id:2272860 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407774 deploy_path:\\\"/tiflash\\\" , already registered by id:88 address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:3930\\\" labels:<key:\\\"engine\\\" value:\\\"tiflash\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/jke-fd\\\" value:\\\"2\\\" > labels:<key:\\\"failure-domain.beta.kubernetes.io/zone\\\" value:\\\"cn-north-1a\\\" > labels:<key:\\\"kubernetes.io/hostname\\\" value:\\\"k8s-node-vmilzu-k62y83l2f1\\\" > version:\\\"v4.0.8\\\" peer_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20170\\\" status_address:\\\"tidb-pkx070i9ww-tiflash-0.tidb-pkx070i9ww-tiflash-peer.tidb-pkx070i9ww.svc:20292\\\" git_hash:\\\"f0a78d93e440dac7c7935ea7e67c656b1bb5f913\\\" start_timestamp:1630407038 deploy_path:\\\"/tiflash\\\" last_heartbeat:1629718058744395420 \") }))"]

serverlog:

[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_10(65)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_10(65), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_11(67)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_11(67), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_12(69)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_12(69), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_13(71)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_13(71), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_14(73)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_14(73), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_15(75)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_15(75), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_16(77)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_16(77), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_17(79)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_17(79), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_18(81)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_18(81), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_19(83)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_19(83), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_20(85)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_20(85), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_21(87)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_21(87), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_22(89)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_22(89), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_23(91)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_23(91), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_24(93)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_24(93), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_25(95)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_25(95), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_26(97)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_26(97), not altering"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lbs_counter(45).voucher_snapshot_27(99)"] [thread_id=1]
[2021/08/31 11:05:12.145 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lbs_counter(45).voucher_snapshot_27(99), not altering"] [thread_id=1]
[2021/08/31 11:05:12.149 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lp_voucher(572).lp_voucher(580)"] [thread_id=1]
[2021/08/31 11:05:12.149 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lp_voucher(572).lp_voucher(580), not altering"] [thread_id=1]
[2021/08/31 11:05:12.149 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lp_voucher(572).lp_voucher_new_0(589)"] [thread_id=1]
[2021/08/31 11:05:12.149 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lp_voucher(572).lp_voucher_new_0(589), not altering"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table lp_transnew(576).trans_standard_new(578)"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table lp_transnew(576).trans_standard_new(578), not altering"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["SchemaBuilder: Loaded all schemas."] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["SchemaSyncer: end sync schema, version has been updated to 679"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["KVStore: Restored 0 regions. "] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["RegionTable: Start to restore"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["RegionTable: Restore 0 tables"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["BackgroundService: Configuration raft.disable_bg_flush is set to true, background flush tasks are disabled."] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["Application: Flash service registered"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["Application: Diagnostics service registered"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["grpc: /root/grpc/src/cpp/server/server_builder.cc, line number : 309, log msg : Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["grpc: /root/grpc/src/core/lib/iomgr/tcp_server_posix.cc, line number : 325, log msg : Failed to add :: listener, the environment may not support IPv6: {\"created\":\"@1630407912.151512814\",\"description\":\"Address family not supported by protocol\",\"errno\":97,\"file\":\"/root/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc\",\"file_line\":382,\"os_error\":\"Address family not supported by protocol\",\"syscall\":\"socket\",\"target_address\":\"[::]:3930\"}"] [thread_id=1]
[2021/08/31 11:05:12.151 +00:00] [INFO] [<unknown>] ["Application: Flash grpc server listening on [0.0.0.0:3930]"] [thread_id=1]
[2021/08/31 11:05:12.152 +00:00] [INFO] [<unknown>] ["Application: Listening http://0.0.0.0:8123"] [thread_id=1]
[2021/08/31 11:05:12.153 +00:00] [INFO] [<unknown>] ["Application: Listening tcp: 0.0.0.0:9000"] [thread_id=1]
[2021/08/31 11:05:12.153 +00:00] [INFO] [<unknown>] ["Application: Listening interserver http: 0.0.0.0:9009"] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Application: Available RAM = 342.21 GiB; physical cores = 46; threads = 92."] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Application: Ready for connections."] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Prometheus: Config: status.metrics_interval = 15"] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Prometheus: Disable prometheus push mode, cause status.metrics_addr is not set!"] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Prometheus: Enable prometheus pull mode; Metrics Port = 8234"] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["ClusterManagerService: Registered timed cluster manager task at rate 10 seconds"] [thread_id=1]
[2021/08/31 11:05:12.154 +00:00] [INFO] [<unknown>] ["Application: let tiflash proxy start all services"] [thread_id=1]

大概总结出了以下的操作方式:
这个集群的tiflash replica是1,现在tiflash-0一直crash,tiflash-1是正常的,tiflash-1应该是已经同步了数据。tiflash-0可以直接删掉。但是删之前需要从pd-ctl里面删掉store,用
curl -v -X DELETE http://pdip:2379/pd/api/v1/store/xx?force=true 其中xx 代表store id
然后删除pvc
删除pod

当tiflash-0重建后,会同步数据,同步完后,tiflash-1会被缩回去,集群恢复正常。

如有错误,感谢支出。

感谢诸位高手的分享。我最后解决方法就是参考以上的经验。补充k8s一点如下:

我使用PV策略是reclaim(因为我用AWS上面的SSD instance store, 详见AWS上使用NVMe为tiflash store),所以当我删除PV后,PV 物理磁盘上面其实仍然保留原来的数据,所以即便:

  1. ./pd-ctl -u http://localhost:2379 store remove-tombstone
  2. unbound PVC
  3. delete有reclaim policy的PV之后
  4. delete pod
  5. kubectel edit 把 spec.tiflash.replicas=0
  6. delete statefulset

然后重启tiflash replicas=2,tiflash仍然不停重启,然后tombstonestores
仍然在:

tiflash logs

[store="id: 65 address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:3930\" labels { key: \"engine\" value: \"tiflash\" } version: \"v5.0.0-rc\" peer_address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:20170\" status_address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:20292\" git_hash: \"06fbf2ac0d494a9a567d077623685410e5dfc10d\" start_timestamp: 1636903391 deploy_path: \"/tiflash\""]
[2021/11/14 15:23:11.954 +00:00] [FATAL] [server.rs:683] ["failed to start node: StoreTombstone(\"store is tombstone\")"]
          
[2021/11/14 15:55:46.929 +00:00] [INFO] [node.rs:184] ["put store to PD"] [store="id: 65 address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:3930\" labels { key: \"engine\" value: \"tiflash\" } version: \"v5.0.0-rc\" peer_address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:20170\" status_address: \"tidb-pods-pro-tiflash-0.tidb-pods-pro-tiflash-peer.tidb-cluster-pro.svc:20292\" git_hash: \"06fbf2ac0d494a9a567d077623685410e5dfc10d\" start_timestamp: 1636905346 deploy_path: \"/tiflash\""]
[2021/11/14 15:55:46.929 +00:00] [FATAL] [server.rs:683] ["failed to start node: StoreTombstone(\"store is tombstone\")"]

最后我只好手动更改PV的策略为delete如下

kubectl patch pv ${pv_name} -p ‘{“spec”:{“persistentVolumeReclaimPolicy”:“Delete”}}’

然后重复上面1-5步骤,才彻底消灭了tombstone stores. TiFlash才恢复正常。

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。