7.1.1升级到7.5.3

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】

7.1.1升级到7.5.3 过程中,不停服升级,tiflash遇到compact一直无法重启,导致升级超时失败

[command.rs:169] [“can’t flush data, filter CompactLog”] [compact_term=7312] [compact_index=21512882] [term=7312] [index=21512884] [region_epoch=“conf_ver: 393569 version: 24647”] [region_id=7185774]

大佬们有啥好的建议,我是需要先停止tiflash,然后再滚动升级嘛?

观察最终 tiflash 能启动么?只是启动的慢么?

如果只是启动的慢,–transfer-timeout 调高一些。

1 个赞

是一直还没关闭

tiflash 的升级版本提到了 tiup 版本,试试升级一下 tiup 版本?
TiFlash 升级帮助 | PingCAP 文档中心

tiup 已经component cluster version v1.16.0 is already installed

TiKV leader 会定期发送一个消息催促 TiFlash 落盘数据,进而可以清理掉已经被 apply 的 raft log。这个过程称为 CompactLog。
CompactLog 可以被 TiFlash 拒绝。拒绝的原因通常是当前缓存的写入过小,此时刷盘浪费资源。
出现这样的日志是正常的,不意味着存在错误。
现在遇到的问题是 TiFlash 无法重启,和 CompactLog 没有直接的关系。建议提供具体的升级操作步骤用来诊断。
另外,可以提供一下关不掉的节点的 tiflash.log,看下是否有线程卡住了。

2 个赞

关不掉啊。。。。。这么离谱

1、按照文档正常制作离线升级包,并升级
2、使用不停服滚动升级方式 tiup cluster upgrade datacenter-tidb v7.5.3 --wait-timeout 600
发现问题:tiflash 一直无法关闭,最终超时,升级失败结束

tiup cluster display 显示对应的 TiFlash 的状态是 Up,ps 也能看到这个 TiFlash 的进程还在?

对的,服务一切正常,毫无重启迹象

tiup 的输出是什么,有报错么?

2024-08-29T14:22:28.545+0800 INFO Execute command finished {“code”: 1, “error”: “failed to restart: 10.2.16.155 tiflash-9000.service, please check the instance’s log(/data1/tid
b-deploy/tiflash-9000/log) for more detail.: timed out waiting for tiflash 10.2.16.155:9000 to be ready after 600s: Get "http://10.2.16.155:20292/tiflash/store-status\”: dial tcp 10.2.16.1
55:20292: connect: connection refused", “errorVerbose”: “Get "http://10.2.16.155:20292/tiflash/store-status\”: dial tcp 10.2.16.155:20292: connect: connection refused\ntimed out waiting fo
r tiflash 10.2.16.155:9000 to be ready after 600s\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*TiFlashInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/tiflash.go:954\ngithub.com/p
ingcap/tiup/pkg/cluster/operation.restartInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:366\ngithub.com/pingcap/tiup/pkg/cluster/operation.upgradeInstance\n\tgithub.com
/pingcap/tiup/pkg/cluster/operation/upgrade.go:257\ngithub.com/pingcap/tiup/pkg/cluster/operation.Upgrade\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/upgrade.go:187\ngithub.com/pingcap
/tiup/pkg/cluster/manager.(*Manager).Upgrade.func2\n\tgithub.com/pingcap/tiup/pkg/cluster/manager/upgrade.go:320\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Func).Execute\n\tgithub.com/ping
cap/tiup/pkg/cluster/task/func.go:34\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/task.go:86\ngithub.com/pingcap/tiup/pkg/cluster/
manager.(*Manager).Upgrade\n\tgithub.com/pingcap/tiup/pkg/cluster/manager/upgrade.go:324\ngithub.com/pingcap/tiup/components/cluster/command.newUpgradeCmd.func1\n\tgithub.com/pingcap/tiup/c
omponents/cluster/command/upgrade.go:60\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/
spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.6.1/command.go:968\ngithub.com/pingcap/tiup/components/cluster/command.Execute\n\t
github.com/pingcap/tiup/components/cluster/command/root.go:297\nmain.main\n\tgithub.com/pingcap/tiup/components/cluster/main.go:23\nruntime.main\n\truntime/proc.go:267\nruntime.goexit\n\tru
ntime/asm_amd64.s:1650\nfailed to restart: 10.2.16.155 tiflash-9000.service, please check the instance’s log(/data1/tidb-deploy/tiflash-9000/log) for more detail."}

tiflash log:

[2024/08/29 14:11:29.042 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.042 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.042 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.042 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.042 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.044 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.044 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.044 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/08/29 14:11:29.044 +08:00] [INFO] [mod.rs:256] [“Storage stopped.”]
[2024/08/29 14:11:29.822 +08:00] [INFO] [node.rs:497] [“stop raft store thread”] [store_id=223]
[2024/08/29 14:11:29.823 +08:00] [INFO] [batch.rs:585] [“shutdown batch system apply”]
[2024/08/29 14:11:29.826 +08:00] [INFO] [router.rs:333] [“broadcasting shutdown”]
[2024/08/29 14:11:29.931 +08:00] [INFO] [batch.rs:598] [“batch system apply is stopped.”]
[2024/08/29 14:11:29.931 +08:00] [INFO] [batch.rs:585] [“shutdown batch system raftstore-223”]
[2024/08/29 14:11:29.931 +08:00] [INFO] [router.rs:333] [“broadcasting shutdown”]
[2024/08/29 14:11:30.055 +08:00] [INFO] [batch.rs:598] [“batch system raftstore-223 is stopped.”]
[2024/08/29 14:11:30.055 +08:00] [INFO] [forwarder.rs:113] [“shutdown tiflash observer”] [store_id=223]
[2024/08/29 14:11:30.105 +08:00] [INFO] [forwarder.rs:113] [“shutdown tiflash observer”] [store_id=223]
[2024/08/29 14:11:30.134 +08:00] [INFO] [run.rs:231] [“all services in raft-store proxy are stopped”]
[2024/08/29 14:11:30.134 +08:00] [INFO] [run.rs:233] [“wait for engine-store server to stop”]
[2024/08/29 14:11:36.406 +08:00] [INFO] [store.rs:625] [“failed to schedule store tick, are we shutting down?”] [err=“"SendError(…)"”] [tick=CompactLockCf]
[2024/08/29 14:11:37.775 +08:00] [INFO] [run.rs:239] [“engine-store server is stopped”]
[2024/08/29 14:11:38.660 +08:00] [INFO] [lib.rs:55] [“Welcome To RaftStore Proxy”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Git Commit Hash: 521fd9dbc55e58646045d88f91c3c35db50b5981”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Git Commit Branch: HEAD”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“UTC Build Time: 2024-08-01 04:24:59”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Storage Engine: tiflash”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Prometheus Prefix: tiflash_proxy_”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Profile: release”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [lib.rs:57] [“Enable Features: portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [mod.rs:80] [“cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={10, 13, 18, 23, 37, 15, 45, 2, 44, 12, 5, 21, 26, 4, 39, 1, 32, 41, 6, 42, 47, 9, 11, 0, 25, 19, 30, 7, 27, 34, 28, 35, 16, 46, 29, 17, 31, 33, 40, 24, 22, 38, 8, 14, 3, 20, 36, 43}”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [mod.rs:87] [“memory limit in bytes: 134317207552, cpu cores quota: 48”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [config.rs:726] [“kernel parameters”] [value=32768] [param=net.core.somaxconn]
[2024/08/29 14:11:38.685 +08:00] [INFO] [config.rs:726] [“kernel parameters”] [value=0] [param=net.ipv4.tcp_syncookies]
[2024/08/29 14:11:38.685 +08:00] [INFO] [config.rs:726] [“kernel parameters”] [value=0] [param=vm.swappiness]
[2024/08/29 14:11:38.685 +08:00] [INFO] [run.rs:389] [“bootstrap with tikv-raft-engine”]
[2024/08/29 14:11:38.685 +08:00] [INFO] [util.rs:637] [“connecting to PD endpoint”] [endpoints=10.2.16.155:2379]
[2024/08/29 14:11:38.685 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2024/08/29 14:11:38.774 +08:00] [INFO] [util.rs:637] [“connecting to PD endpoint”] [endpoints=10.2.17.130:2379]

感觉是发了stop,但是没死成,但是我手动发起stop是可以提供掉服务的(后来单独试了)

  1. 14:11 是 tiflash_tikv.log 最后的日志了么?如果是的话,tail 下另一个日志文件,它叫 tiflash.log 或者 server.log 之类的名字。
  2. 14.22 显示 failed to start。后续情况如何呢?比如现在是否在 Up 状态?

14:11是我发起的升级的时间,后面还一直有日志,服务一直都是up的,后来我手动停止了一次(30s左右),然后我又拉起来了(1分钟左右)

tiflash.log (124.1 KB)
这是11~13分的日志,反正是服务没死,后面就是类似的日志了

从日志来看,TiFlash 在以下时刻发生了重启。也就是说 TiFlash 是被关掉的。
[2024/08/29 14:11:38.660 +08:00] [INFO] [lib.rs:55] [“Welcome To RaftStore Proxy”]

从 tiup 报错来看,TiFlash 在重启后,在 600s 内还不能够服务。这也不是重启失败所致,而是因为 TiFlash 在重启后需要追赶进度。例如,如果集群写入负载较大,或者重启前 TiFlash 就进度落后,或者 TiFlash 上 Region 很多,那么在重启后,TiFlash 会花费较长时间向 TiKV 追最新的数据,在这段时间中,因为数据依旧落后,是无法提供服务的,也就是 tiup 给到的错误。

所以这个问题实际上是第一次重启后,TiFlash 需要花费较长时间来 restore 和追数据导致的。你第二次手动重启能快速恢复,也是因为数据已经追上的缘故。正如你后续看到集群处于 Up 状态,实际上是已经重启过的。

那我是不是只要加大时间就行了

一般来说,如果 tiup 报错如上,你立即 display 一下,TiFlash 有可能不在 Up 状态的。这个时候确实可以继续等待 TiFlash 追赶进度。

可以通过 TiFlash 的 grafana 面板跟踪 TiFlash 的启动状态、以及同步进度。