tiup bench tpcc prepare过程tivk离线

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
tidb v5.2.1
【概述】场景+问题概述
通过tiup bench tpcc -H 127.0.0.1 -P 4000 -D tpcc --warehouses 5000 -T 2000 run建立压测数据

【背景】做过哪些操作
【现象】业务和数据库现象
tikv离线1台且报错无法重连,tidb监控后台无法查看日志
【业务影响】
【TiDB 版本】
【附件】

[2021/10/25 16:31:35.783 +08:00] [INFO] [] [“New connected subchannel at 0x7f814442e3c0 for subchannel 0x7f80ac36a200”]
[2021/10/25 16:31:35.783 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://192.168.10.147:2379]
[2021/10/25 16:31:35.784 +08:00] [INFO] [util.rs:668] [“connected to PD member”] [endpoints=http://192.168.10.147:2379]
[2021/10/25 16:31:35.784 +08:00] [INFO] [util.rs:202] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/10/25 16:31:35.784 +08:00] [INFO] [tso.rs:148] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2021/10/25 16:31:35.784 +08:00] [INFO] [client.rs:136] [“TSO stream is closed, reconnect to PD”]
[2021/10/25 16:31:35.785 +08:00] [INFO] [util.rs:230] [“update pd client”] [via=] [leader=http://192.168.10.147:2379] [prev_via=] [prev_leader=http://192.168.10.147:2379]
[2021/10/25 16:31:35.785 +08:00] [INFO] [util.rs:357] [“trying to update PD client done”] [spend=2.570008ms]
[2021/10/25 16:31:35.785 +08:00] [WARN] [client.rs:138] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval")”]
[2021/10/25 16:31:35.785 +08:00] [WARN] [mod.rs:521] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2021/10/25 16:31:35.785 +08:00] [WARN] [mod.rs:528] [“failed to reconnect pd client”] [err=“Other("[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval")”]
[2021/10/25 16:31:36.086 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://192.168.10.146:2381]
[2021/10/25 16:31:36.086 +08:00] [INFO] [] [“New connected subchannel at 0x7f814382e210 for subchannel 0x7f814305c000”]
[2021/10/25 16:31:36.087 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://192.168.10.147:2379]
[2021/10/25 16:31:36.088 +08:00] [INFO] [util.rs:668] [“connected to PD member”] [endpoints=http://192.168.10.147:2379]
[2021/10/25 16:31:36.088 +08:00] [INFO] [util.rs:202] [“heartbeat sender and receiver are stale, refreshing …”]
[2021/10/25 16:31:36.088 +08:00] [INFO] [tso.rs:148] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2021/10/25 16:31:36.088 +08:00] [INFO] [util.rs:230] [“update pd client”] [via=] [leader=http://192.168.10.147:2379] [prev_via=] [prev_leader=http://192.168.10.147:2379]
[2021/10/25 16:31:36.088 +08:00] [INFO] [util.rs:357] [“trying to update PD client done”] [spend=2.259162ms]
[2021/10/25 16:31:36.174 +08:00] [FATAL] [lib.rs:465] [“[region 80460] 80462 applying snapshot failed”] [backtrace="stack backtrace:
0: tikv_util::set_panic_hook::{{closure}}
at components/tikv_util/src/lib.rs:464
1: std::panicking::rust_panic_with_hook
at library/std/src/panicking.rs:626
2: std::panicking::begin_panic_handler::{{closure}}
at library/std/src/panicking.rs:519
3: std::sys_common::backtrace::__rust_end_short_backtrace
at library/std/src/sys_common/backtrace.rs:141
4: rust_begin_unwind
at library/std/src/panicking.rs:515
5: std::panicking::begin_panic_fmt
at library/std/src/panicking.rs:457
6: raftstore::store::peer_storage::PeerStorage<EK,ER>::check_applying_snap
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:1388
7: raftstore::store::peer::Peer<EK,ER>::handle_raft_ready_append
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:1580
8: raftstore::store::fsm::peer::PeerFsmDelegate<EK,ER,T>::collect_ready
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1058
<raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm>>::handle_normal
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:926
9: batch_system::batch::Poller<N,C,Handler>::poll
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:414
10: batch_system::batch::BatchSystem<N,C>::start_poller::{{closure}}
at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:541
std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125
11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476
<std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347
std::panicking::try::do_call
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401
std::panicking::try
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365
std::panic::catch_unwind
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434
std::thread::Builder::spawn_unchecked::{{closure}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475
core::ops::function::FnOnce::call_once{{vtable.shim}}
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227
12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
std::sys::unix::thread::thread::new::thread_start
at library/std/src/sys/unix/thread.rs:91
13: start_thread
14: __clone
"] [location=/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:1388] [thread_name=raftstore-1-0]

  1. TiUP Cluster Display 信息

  2. TiUP Cluster Edit Config 信息

  3. TiDB- Overview 监控

  • 对应模块日志(包含问题前后1小时日志)
2 个赞

这个问题,辛苦提供下 panic 报错 tikv 日志完整的信息,另外,看下这个集群的拓扑是怎样的?

2 个赞

感谢回答

请问下如何panic日志
拓扑结构如下
简单概括就是
三台机器 80核 256g内存 每台机器上部署了
pd2
tikv
4
tidb2
tiflash
1
截图截不全

现在的我尝试的措施
停止压测
然后执行了tiup cluster scale-in tidb-online --node 192.168.10.146:20160
缩容有问题的节点
观察节点日志还是存在一样的报错,但是一段时间过后,有问题tikv节点的状态变为Tombstone
执行tiup cluster prune后集群正常,缩容成功,集群正常
像这种情况下会丢数据吗?

2 个赞

在中途其他节点也掉线过,但是马上就能重启恢复,目前我限制了
rocksdb.defaultcf.block-cache-size 20000MiB
rocksdb.writecf.block-cache-size 13000MiB
每台机器缩容到了3个tikv,重新跑压测看下是不是还会挂tikv

2 个赞

这段信息显示 tikv 节点出现了 painc,所以建议你那边拿下这个 tikv 节点更多的 log 信息,看下 panic 这个报错的原因 ~

2 个赞

你好 因为这个异常节点被我缩容了,目录被删除日志已经拿不到了(这种异常情况会丢失数据吗),我昨晚又压测了一波,早上过来后tidb基本无法提供服务了,tpcc脚本提示如下
exec statement error: context canceled, may try again later…
关闭压测脚本后tidb恢复正常
但是后台日志一直在报错
日志如下
tikv.log (14.6 MB)
由于日志被切分,再之前的日志上传到百度云了
链接: 百度网盘-链接不存在 提取码: t7ku 复制这段内容后打开百度网盘手机App,操作更方便哦

2 个赞

在默认 3 副本的情况以及单机多实例+label 的情况,缩容一个 tikv 节点理论上不会造成数据丢失,因为 pd 会将被缩容节点上的 region 副本以及 leader balance 和 transfter 到其他 store 上 ~

你那里在进行压测时,整个 tidb 的 CPU,IO,以及 网卡资源的消耗情况是怎样的?

1 个赞

cpu和网卡情况占用都很低,io被直接打满了 大部分都是dm/jbd2

我们再明确下问题哈,最新一次的压测,你那里压测 TiDB 集群的目的是希望看下 tpcc 下 TiDB 集群的性能表现,以及当前的拓扑以及配置 TiDB 集群是否还有性能提升的空间吗?

是的 因为我不知道针对我这三台物理机的配置该在每台机器上布多少个kv实例,但是一压就发现实例会掉线/报错,我把线程数降到500后最后压测的结果tpcc才4w多 感觉远远没有达到这个机器的性能瓶颈(似乎被io限制)

在 tidb 集群安装部署的时候,生产环境是不建议 tidb,tikv,tiflash,pd 都混布在一台服务器上如果对性能要求,并且需要控制成本,那么可以参考下面的文档进行部署和配置,比如 numa 绑核,PD 、TiKV、TiFlash 使用单独的物理盘,避免 IO 争用:

https://docs.pingcap.com/zh/tidb/stable/three-nodes-hybrid-deployment#三节点混合部署的最佳实践

https://docs.pingcap.com/zh/tidb/stable/hybrid-deployment-topology#混合部署拓扑

在遇到写入或者查询的性能问题时,可以参考下面的帖子:

jdb2 占用高的问题,在本站中有相应的帖子,可参考,部分帖子如下:

TIKV三台服务器所在IO都非常高,特别是jbd2/vdb1-8,IO占用50%以上。

隐藏esc坑之jbd2进程io占用奇高 系统长期io占用100%

tidb 4.0-rc tikv 节点的数据盘 IO 使用率 100%

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。