8.1.1升级8.5之后,小内存扫表会卡死

【 TiDB 使用环境】POC
【 TiDB 版本】8.5 LTS
【复现路径】任意超过内存的大表扫表排序。

【遇到的问题:问题现象及影响】
升级8.5之后,准备一个表(容量能够触发早期版本OOM的即可)。
使用select * from xx order by 不加索引列。强行进行不筛选排序。
内存占用达到上限后,kv,db都不会重启,但是整个tidb会卡住,dashboard访问不了,shutdown命令能执行成功。
从宿主机观测,扫表打满内存后,就不从kv或pd读数据了。一直在读系统盘,也就是deploy和log所在的盘,不会去读数据盘。这个视频有录到。配置可以看下面配置文件。kv123和other与宿主机上看到的挂载的磁盘名字对应。除了这四个盘之外,其他都是读写系统盘的。

同样的数据,和配置,使用8.1.1版本。
kv和db会被重启。从宿主机可以观测到,kv重启后,会去读kv盘的数据。第一次sql执行会失败,但是再次尝试。最终是能把结果跑出来的。

但是使用8.5就是不行。已经多次尝试。一直等到连接超时也不会有数据返回。但是如果把内存调大到不会吃满,比如把内存升到32G。那8.1.1和8.5都可以迅速返回结果。

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

部署方式,单机3kv(分开三个物理磁盘),1db,1pd。
配置8c16t,16G虚拟机。

升级8.5之后,因为tiflash用的很少,为了节约资源,我把tiflash缩掉了。但是扫的表是跟tiflash没关系的,所以不是tiflash的问题。

其他的差异就只有因为8.5不支持centos7了,升了rocky8.10。


【附件:截图/日志/监控】

global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /tidb-deploy
data_dir: /data_other
os: linux
systemd_mode: system
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /tidb-deploy/monitor-9100
data_dir: /data_other/monitor-9100
log_dir: /tidb-deploy/monitor-9100/log
server_configs:
tidb:
log.slow-threshold: 300
tikv:
readpool.coprocessor.use-unified-pool: true
readpool.storage.use-unified-pool: false
pd:
replication.enable-placement-rules: true
replication.location-labels:
- host
tso: {}
scheduling: {}
tidb_dashboard: {}
tiflash:
logger.level: warn
tiproxy: {}
tiflash-learner: {}
pump: {}
drainer: {}
cdc: {}
kvcdc: {}
grafana: {}
tidb_servers:

  • host: 127.0.0.1
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: /tidb-deploy/tidb-4000
    log_dir: /tidb-deploy/tidb-4000/log
    resource_control:
    memory_limit: 5G
    arch: amd64
    os: linux
    tikv_servers:
  • host: 127.0.0.1
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: /tidb-deploy/tikv-20160
    data_dir: /data_kv1
    log_dir: /tidb-deploy/tikv-20160/log
    config:
    server.labels:
    host: logic-host-1
    arch: amd64
    os: linux
  • host: 127.0.0.1
    ssh_port: 22
    port: 20161
    status_port: 20181
    deploy_dir: /tidb-deploy/tikv-20161
    data_dir: /data_kv2
    log_dir: /tidb-deploy/tikv-20161/log
    config:
    server.labels:
    host: logic-host-2
    arch: amd64
    os: linux
  • host: 127.0.0.1
    ssh_port: 22
    port: 20162
    status_port: 20182
    deploy_dir: /tidb-deploy/tikv-20162
    data_dir: /data_kv3
    log_dir: /tidb-deploy/tikv-20162/log
    config:
    server.labels:
    host: logic-host-3
    arch: amd64
    os: linux
    tiflash_servers:
    tiproxy_servers:
    pd_servers:
  • host: 127.0.0.1
    ssh_port: 22
    name: pd-127.0.0.1-2379
    client_port: 2379
    peer_port: 2380
    deploy_dir: /tidb-deploy/pd-2379
    data_dir: /data_other/pd-2379
    log_dir: /tidb-deploy/pd-2379/log
    arch: amd64
    os: linux
    monitoring_servers:
  • host: 127.0.0.1
    ssh_port: 22
    port: 9090
    ng_port: 12020
    deploy_dir: /tidb-deploy/prometheus-9090
    data_dir: /data_other/prometheus-9090
    log_dir: /tidb-deploy/prometheus-9090/log
    external_alertmanagers:
    resource_control:
    memory_limit: 1G
    arch: amd64
    os: linux
    grafana_servers:
  • host: 127.0.0.1
    ssh_port: 22
    port: 3000
    deploy_dir: /tidb-deploy/grafana-3000
    resource_control:
    memory_limit: 1G
    arch: amd64
    os: linux
    username: admin
    password: admin
    anonymous_enable: false
    root_url: “”
    domain: “”
1 个赞

部署方式,单机3kv(分开三个物理磁盘),1db,1pd。

所有的节点和实例,都是这个配置么?(8c16t,16G),还是物理节点就这么点资源?


  1. 查询是否有开启落盘支持?
  2. 资源控制是否有开启?
  3. tidb 卡住后,PD 节点的日志是否有异常,tidb 节点的日志描述的什么内容?
1 个赞

https://docs.pingcap.com/zh/tidb/stable/hybrid-deployment-topology#混合部署的关键参数介绍

storage.block-cache.capacity = (MEM_TOTAL * 0.5 / TiKV 实例数量)

不设置这个参数的情况下,每个tikv以为自己独占16g内存,会oom,这个我能理解。

内存占用达到上限后,kv,db都不会重启,但是整个tidb会卡住,dashboard访问不了,shutdown命令能执行成功。
从宿主机观测,扫表打满内存后,就不从kv或pd读数据了。一直在读系统盘,也就是deploy和log所在的盘,不会去读数据盘。这个视频有录到。配置可以看下面配置文件。kv123和other与宿主机上看到的挂载的磁盘名字对应。除了这四个盘之外,其他都是读写系统盘的。

这段是我不能理解的,但我模糊的感觉是,问题可能和操作系统有关。

我觉得你可以尝试把pd和tidb放在一起,3个tikv还是放一台,给pd和tidb 2c4g,给3个tikv 6c12g。再试试,因为这样部署,资源没变得前提下,我感觉起码dashboard能打开,dashboard能打开就能做一些各种profiling的分析。

https://docs.pingcap.com/zh/tidb/stable/dashboard-profiling#tidb-dashboard-实例性能分析---手动分析页面

不然是没有头绪的。

1 个赞

谢谢大神提供思路。晚点回去看一下。这个集群是4.0一点点升上来的。早先没有的参数就一直没变过。无脑tiup。

谢谢大神提供思路。我尝试把他们切开。

最开始布在一起主要是考虑资源能抢占,跑完或者爆掉的部分被暂时挤掉,在老板本上特别有效。毕竟切开了,有几个核就不能完全利用了,穷就想了些穷办法。不过确实给排查问题增加了难度。

1 个赞

1.开启落盘支持
2.开启资源管控
3.日志如下,感觉像是PD死了

pd日志

[2024/12/30 23:02:32.286 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:35.734 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:36.174 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a1e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:39.253 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:40.142 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:43.666 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:47.535 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:48.508 +08:00] [INFO] [client.go:210] ["Auto sync endpoints failed."] [error="context deadline exceeded"]
[2024/12/30 23:02:49.321 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:49.832 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a1e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:50.156 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:02:57.116 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:01.473 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a1e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:04.256 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:13.216 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc00323e000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:10.745 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:11.438 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc002a461e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:13.949 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:14.030 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:16.707 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:19.849 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a1e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:36.173 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:38.574 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:39.093 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:37.749 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a1e0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:39.564 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:39.804 +08:00] [INFO] [client.go:210] ["Auto sync endpoints failed."] [error="context deadline exceeded"]
[2024/12/30 23:03:40.532 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:47.322 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc00323e000/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/12/30 23:03:47.549 +08:00] [WARN] [retry_interceptor.go:63] ["retrying of unary invoker failed"] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

kv的日志很多,感觉相关的如下

[2024/12/30 23:03:46.094 +08:00] [INFO] [service.rs:70] ["pd meta client creating watch stream."] [rev=102864810] [path=resource_group/settings] [thread_id=18]
[2024/12/30 23:03:46.241 +08:00] [WARN] [client.rs:155] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:377]: cancel reconnection due to too small interval\")"] [thread_id=12]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [service.rs:97] ["failed to watch resource groups"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=18]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=171]
[2024/12/30 23:03:46.248 +08:00] [ERROR] [pd.rs:1352] ["store heartbeat failed"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=34]
[2024/12/30 23:03:46.258 +08:00] [WARN] [pd.rs:1785] ["report min resolved_ts failed"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=34]
[2024/12/30 23:03:46.494 +08:00] [WARN] [client.rs:155] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:377]: cancel reconnection due to too small interval\")"] [thread_id=12]
[2024/12/30 23:03:46.543 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.543 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.543 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.543 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.637 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=134]
[2024/12/30 23:03:46.648 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=132]
[2024/12/30 23:03:46.649 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.674 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:46.764 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379] [thread_id=12]
[2024/12/30 23:03:46.936 +08:00] [WARN] [pd.rs:1785] ["report min resolved_ts failed"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=34]
[2024/12/30 23:03:47.329 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.393 +08:00] [INFO] [service.rs:70] ["pd meta client creating watch stream."] [rev=102864810] [path=resource_group/settings] [thread_id=18]
[2024/12/30 23:03:47.393 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.394 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.394 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.394 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.394 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:StreamDisconnect] [err="StreamDisconnect(SendError { kind: Disconnected })"] [thread_id=34]
[2024/12/30 23:03:47.408 +08:00] [ERROR] [service.rs:97] ["failed to watch resource groups"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=17]
[2024/12/30 23:03:47.452 +08:00] [ERROR] [util.rs:497] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"] [thread_id=171]

db日志如下

{"level":"warn","ts":1735570956.9086597,"caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003c4ca80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"info","ts":1735570969.7414784,"caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-12-30T23:02:50.168129+0800","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001065880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"info","ts":"2024-12-30T23:03:21.138379+0800","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-12-30T23:03:32.097583+0800","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001065880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":1735571023.3198137,"caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003c4ca80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"info","ts":1735571024.2877212,"caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

PD 几个节点实例? 只有1个么?

是的。poc。所以只有一个pd。
物理节点资源不多。所以在一台机上挤的。

前排板凳,逐字阅读

那就难怪的,poc 还是要给足够的资源了,
混部就很难调参数,还资源不足,随便压一压就出毛病了

我给过更大内存。也会这样。直到内存大到扫表扫完内存都占不满。才不会出现问题。这个现象在8.1.1是没有的。所以这个可以理解为8.5对环境要求变高了,需要满足最小配置了。
我试试把pd单独分出去看能不能改善。

[2024/12/30 23:03:39.564 +08:00] [WARN] [retry_interceptor.go:63] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc004b1a3c0/127.0.0.1:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

连不上pd

[2024/12/30 23:03:46.248 +08:00] [ERROR] [pd.rs:1352] [“store heartbeat failed”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”] [thread_id=34]

tikv的心跳也不能发送

从日志输出来看,不像是pd卡死了,而是pd一直在尝试连接,但是死活连不上。

我总觉得应该排除一下os的影响,换个ubuntu试试。

刚好有升了rocky没升tidb的快照。实测在rocky下,8.1.1也会炸,现象跟8.5一样。

1 个赞

os侧有问题的概率就变大了。

我试试能不能升redhat吧,也是tidb推荐的环境,不行我再ubuntu,好像centos升ubuntu不是很容易。

跑题问一下,你操作系统是啥? V8之后完全不支持CentOS7了,V8以前到V8.1你如何过渡到新系统的?

我是centos7->rocky8
8.4之后才不支持的。v8之前直接tiup就可以到8.0 8.1 8.1.1。
另外看帖子说8.5.1会恢复centos7支持,参考 tidb8.5不支持centos7.x - #7,来自 residentevil

1 个赞

从4一直升到8啊,,这是专门测试的吧,你是每次升级都跨一个大版本吗

看到了 我靠太重要了

1 个赞

V7升到V8.1的时候 你们是什么系统,我现在V7是centos7