tidb 集群测试中报了很多错,目前看数据库状态正常

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】7.51
【复现路径】中度压力的长期压力测试
【遇到的问题:问题现象及影响】
4月13日3点多业务那边压测故障中断了,数据库一堆报错
tikv.zip (21.8 KB)
tidb.zip (4.2 KB)
pd.zip (2.4 KB)

一些看不懂的错误:
tidb相关:

[2024/04/13 04:02:57.080 +08:00] [Error] [pd_service_discovery.go:534] ["[pd] failed to update member"] [urls="[http://192.168.19.205:2379,http://192.168.19.206:2379,http://192.168.19.207:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/04/13 04:03:51.257 +08:00] [Error] [domain.go:1894] ["update bindinfo failed"] [error="[tikv:9005]Region is unavailable"]
[2024/04/13 04:03:53.263 +08:00] [Error] [domain.go:901] ["reload schema in loop failed"] [error="[tikv:9005]Region is unavailable"]
[2024/04/13 04:04:10.490 +08:00] [Error] [domain.go:1713] ["load privilege failed"] [error="[tikv:9005]Region is unavailable"]
[2024/04/13 04:04:33.392 +08:00] [Error] [domain.go:901] ["reload schema in loop failed"] [error="[tikv:9005]Region is unavailable"]
[2024/04/13 04:04:40.521 +08:00] [Error] [2pc.go:1544] ["Async commit/1PC result undetermined"] [conn=4020837744] [session_alias=] [error="region unavailable"] [rpcErr="rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"] [txnStartTS=449040128571081007]
[2024/04/13 04:04:40.522 +08:00] [Error] [conn.go:1132] ["result undetermined, close this connection"] [conn=4020837744] [session_alias=] [error="previous statement: update clerk_term_daily_report set all_clerk_paid_chances=0.000000000000, all_rdc_paid_chances=0.000000000000, clerk_cancel_chances=0, clerk_cancel_ticket_cnt=0, clerk_canceled_chances=0, clerk_canceled_ticket_cnt=0, clerk_paid_amt=0.00, clerk_paid_ticket_cnt=0, rdc_cancel_chances=0, rdc_cancel_ticket_cnt=0, rdc_paid_amt=0.00, rdc_paid_ticket_cnt=0, rdc_withdrawed_amt=0, sale_chances=1476, sale_ticket_cnt=1476, withdraw_amt=0, withdrawed_amt=0 where term_id=2097411 and rpt_date='2024-04-13' and game_id=200 and clerk_id=42004: [global:2]execution result undetermined"]
[2024/04/13 04:05:11.525 +08:00] [Error] [domain.go:1894] ["update bindinfo failed"] [error="[tikv:9005]Region is unavailable"]

pd组件:

[2024/04/13 04:02:41.920 +08:00] [Error] [etcdutil.go:157] ["load from etcd meet error"] [key=/pd/7330093184493721719/gc/safe_point] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2024/04/13 04:02:43.218 +08:00] [Error] [etcdutil.go:157] ["load from etcd meet error"] [key=/pd/7330093184493721719/timestamp] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2024/04/13 04:02:43.224 +08:00] [Error] [middleware.go:217] ["redirect but server is not leader"] [from=pd-192.168.19.205-2379] [server=pd-192.168.19.205-2379] [error="[PD:apiutil:ErrRedirect]redirect failed"]
[2024/04/13 04:03:13.285 +08:00] [Error] [cluster.go:2047] ["get members error"] [error="[PD:etcd:ErrEtcdMemberList]context deadline exceeded: context deadline exceeded"]

tikv组件

[2024/04/13 04:02:48.599 +08:00] [Error] [kv.rs:781] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
[2024/04/13 04:02:48.599 +08:00] [Error] [kv.rs:781] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
[2024/04/13 04:02:48.599 +08:00] [Error] [kv.rs:781] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
[2024/04/13 04:02:48.659 +08:00] [Error] [kv.rs:774] ["dispatch raft msg from gRPC to raftstore fail"] [err=Grpc(RpcFinished(None))] [thread_id=0x5]
[2024/04/13 04:02:48.659 +08:00] [Error] [kv.rs:781] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
[2024/04/13 04:02:48.672 +08:00] [Error] [kv.rs:956] ["batch_commands error"] [err="RpcFinished(Some(RpcStatus { code: 0-OK, message: \"\", details: [] }))"] [thread_id=0x5]
[2024/04/13 04:02:48.672 +08:00] [Error] [kv.rs:956] ["batch_commands error"] [err="RpcFinished(Some(RpcStatus { code: 0-OK, message: \"\", details: [] }))"] [thread_id=0x5]
[2024/04/13 04:02:48.672 +08:00] [Error] [util.rs:496] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [thread_id=0x5]
[2024/04/13 04:02:48.731 +08:00] [Error] [raft_client.rs:585] ["connection aborted"] [addr=192.168.19.206:20161] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"keepalive watchdog timeout\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"keepalive watchdog timeout\", details: [] })))"] [store_id=2] [thread_id=0x5]
[2024/04/13 04:02:48.733 +08:00] [Error] [kv.rs:956] ["batch_commands error"] [err="RpcFinished(Some(RpcStatus { code: 0-OK, message: \"\", details: [] }))"] [thread_id=0x5]
[2024/04/13 04:02:48.772 +08:00] [Error] [raft_client.rs:904] ["connection abort"] [addr=192.168.19.206:20161] [store_id=2] [thread_id=0x5]
[2024/04/13 04:02:48.790 +08:00] [Error] [kv.rs:774] ["dispatch raft msg from gRPC to raftstore fail"] [err=Grpc(RpcFinished(None))] [thread_id=0x5]
[2024/04/13 04:02:48.805 +08:00] [Error] [kv.rs:781] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
[2024/04/13 04:02:48.837 +08:00] [Error] [kv.rs:956] ["batch_commands error"] [err="RpcFinished(Some(RpcStatus { code: 0-OK, message: \"\", details: [] }))"] [thread_id=0x5]

系统监控


pd组件

tidb组件


tikv组件


TIKV的REGION有问题,PD的ETCD有问题

重启集群后能恢复正常吗?

mark 一下

没重启,目前看是正常的,感觉是自动恢复了

看起来这些报错是压测的时候网络或者磁盘到极限了报的

压测时候负载不算高,按最大值不到一半性能压的

抗疲劳还是有点不行,硬件配置还能提升不

拓扑结构发一下吧,感觉要不是部署不合理,要不是资源不够

3节点混合部署的,6个tikv 3pd 3tidb

额,每个机器多大的配置啊,3个服务器要装6个tikv 3pd 3tidb?

物理机,128G内存,cpu负载整体也不高

那你是每台机器,2tikv,1pd,1tidb-server?
tikv的storage.block-cache.capacity设置多少?
SHOW config WHERE NAME LIKE ‘%storage.block-cache.capacity%’;
建议修改成14G
SET config tikv storage.block-cache.capacity=‘14G’;
tidb 的tidb_server_memory_limit设置多少?
SHOW GLOBAL VARIABLES LIKE ‘%tidb_server_memory_limit%’;
建议修改成32G
SET GLOBAL tidb_server_memory_limit=‘32G’

配置文件

global:
  user: tidb
  ssh_port: 22
  ssh_type: builtin
  deploy_dir: /tidb-deploy
  data_dir: /tidb-data
  os: linux
  arch: amd64
monitored:
  node_exporter_port: 9100
  blackbox_exporter_port: 9115
  deploy_dir: /tidb-deploy/monitor-9100
  data_dir: /tidb-data/monitor-9100
  log_dir: /tidb-deploy/monitor-9100/log
server_configs:
  tidb:
    log.level: error
  tikv:
    rocksdb.max-background-jobs: 30
    rocksdb.max-sub-compactions: 20
    rocksdb.rate-bytes-per-sec: 100G
    storage.block-cache.capacity: 24G
  pd:
    replication.enable-placement-rules: true
    replication.location-labels:
    - host
  tidb_dashboard: {}
  tiflash: {}
  tiproxy: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
  kvcdc: {}
  grafana: {}
tidb_servers:
- host: 192.168.19.206
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  log_dir: /tidb-deploy/tidb-4000/log
  arch: amd64
  os: linux
- host: 192.168.19.207
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  log_dir: /tidb-deploy/tidb-4000/log
  arch: amd64
  os: linux
- host: 192.168.19.205
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  log_dir: /tidb-deploy/tidb-4000/log
  arch: amd64
  os: linux
tikv_servers:
- host: 192.168.19.206
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /tidb-deploy/tikv-20160
  data_dir: /tidb-data/tikv-20160
  log_dir: /tidb-deploy/tikv-20160/log
  config:
    server.labels:
      host: logic-host-1
  arch: amd64
  os: linux
- host: 192.168.19.207
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /tidb-deploy/tikv-20160
  data_dir: /tidb-data/tikv-20160
  log_dir: /tidb-deploy/tikv-20160/log
  config:
    server.labels:
      host: logic-host-2
  arch: amd64
  os: linux
- host: 192.168.19.206
  ssh_port: 22
  port: 20161
  status_port: 20181
  deploy_dir: /tidb-deploy/tikv-20161
  data_dir: /tidb-data1/tikv-20161
  log_dir: /tidb-deploy/tikv-20161/log
  config:
    server.labels:
      host: logic-host-3
  arch: amd64
  os: linux
- host: 192.168.19.207
  ssh_port: 22
  port: 20161
  status_port: 20181
  deploy_dir: /tidb-deploy/tikv-20161
  data_dir: /tidb-data1/tikv-20161
  log_dir: /tidb-deploy/tikv-20161/log
  config:
    server.labels:
      host: logic-host-4
  arch: amd64
  os: linux
- host: 192.168.19.206
  ssh_port: 22
  port: 20162
  status_port: 20182
  deploy_dir: /tidb-deploy/tikv-20162
  data_dir: /tidb-data2/tikv-20162
  log_dir: /tidb-deploy/tikv-20162/log
  config:
    server.labels:
      host: logic-host-5
  arch: amd64
  os: linux
- host: 192.168.19.207
  ssh_port: 22
  port: 20162
  status_port: 20182
  deploy_dir: /tidb-deploy/tikv-20162
  data_dir: /tidb-data2/tikv-20162
  log_dir: /tidb-deploy/tikv-20162/log
  config:
    server.labels:
      host: logic-host-6
  arch: amd64
  os: linux
tiflash_servers: []
tiproxy_servers: []
pd_servers:
- host: 192.168.19.206
  ssh_port: 22
  name: pd-192.168.19.206-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  log_dir: /tidb-deploy/pd-2379/log
  arch: amd64
  os: linux
- host: 192.168.19.207
  ssh_port: 22
  name: pd-192.168.19.207-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  log_dir: /tidb-deploy/pd-2379/log
  arch: amd64
  os: linux
- host: 192.168.19.205
  ssh_port: 22
  name: pd-192.168.19.205-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  log_dir: /tidb-deploy/pd-2379/log
  arch: amd64
  os: linux
monitoring_servers:
- host: 192.168.19.207
  ssh_port: 22
  port: 9090
  ng_port: 12020
  deploy_dir: /tidb-deploy/prometheus-9090
  data_dir: /tidb-data/prometheus-9090
  log_dir: /tidb-deploy/prometheus-9090/log
  external_alertmanagers: []
  arch: amd64
  os: linux
grafana_servers:
- host: 192.168.19.207
  ssh_port: 22
  port: 3000
  deploy_dir: /tidb-deploy/grafana-3000
  arch: amd64
  os: linux
  username: admin
  password: admin
  anonymous_enable: false
  root_url: ""
  domain: ""

三个tikv公用一个服务器?那storage.block-cache.capacity设置的太高了,你storage.block-cache.capacity设置成24G,一个tikv最多能用掉50G内存了,3个都150G内存了,192.168.19.206和192.168.19.207两个服务器上面还有tidb-server和pd,这内存挤占起来,如果你的pd在这两个机器上,直接挂死了,集群都用不了了。。。
要是不方便切换拓扑结构的话,先把tikv的storage.block-cache.capacity设置小点吧,先设置成10G吧。。。起码别把pd给压死了。。。
SET config tikv storage.block-cache.capacity=‘10G’;
monitoring_servers和grafana_servers都能先切换到192.168.19.205上,另外如果你的资源不是很足的话,建议不要3个tidb-server了,把192.168.19.206和192.168.19.207两个服务器上的tidb-server缩容掉一个吧。。。pd可以的话,最好先把leader切换到192.168.19.205机器上

我测试每个tikv使用内存从未超过30G,不过这里我也准备调小了,切205我也考虑下

不建议混部