tidb 4000服务不断重启

有三个tidb server 其中一个不断重启,哥哥姐姐们 帮忙看看。
日志如下
Aug 10 21:12:37 TIDB-PD1 bash: [2022/08/10 21:12:37.001 +08:00] [WARN] [config.go:1004] [“Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold).”]
Aug 10 21:18:31 TIDB-PD1 kernel: audit: audit_lost=5113587 audit_rate_limit=512 audit_backlog_limit=16384
Aug 10 21:18:31 TIDB-PD1 kernel: audit: rate limit exceeded
Aug 10 21:21:51 TIDB-PD1 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 10 21:21:51 TIDB-PD1 systemd: Unit tidb-4000.service entered failed state.
Aug 10 21:21:51 TIDB-PD1 systemd: tidb-4000.service failed.
Aug 10 21:22:06 TIDB-PD1 systemd: tidb-4000.service holdoff time over, scheduling restart.
Aug 10 21:22:06 TIDB-PD1 systemd: Stopped tidb service.
Aug 10 21:22:06 TIDB-PD1 systemd: Started tidb service.
Aug 10 21:22:06 TIDB-PD1 bash: [2022/08/10 21:22:06.749 +08:00] [WARN] [config.go:1004] [“Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold).”]
Aug 10 21:24:00 TIDB-PD1 systemd-logind: New session 5981 of user root.
Aug 10 21:24:00 TIDB-PD1 systemd: Started Session 5981 of user root.
Aug 10 21:24:01 TIDB-PD1 systemd-logind: New session 5982 of user root.
Aug 10 21:24:01 TIDB-PD1 systemd: Started Session 5982 of user root.
Aug 10 21:26:54 TIDB-PD1 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 10 21:26:54 TIDB-PD1 systemd: Unit tidb-4000.service entered failed state.
Aug 10 21:26:54 TIDB-PD1 systemd: tidb-4000.service failed.
Aug 10 21:27:09 TIDB-PD1 systemd: tidb-4000.service holdoff time over, scheduling restart.
Aug 10 21:27:09 TIDB-PD1 systemd: Stopped tidb service.
Aug 10 21:27:09 TIDB-PD1 systemd: Started tidb service.
Aug 10 21:27:09 TIDB-PD1 bash: [2022/08/10 21:27:09.751 +08:00] [WARN] [config.go:1004] [“Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold).”]

  1. tidb 的三个节点的连接情况是否均衡

  2. tibd 三个节点的资源使用情况是否一致,还是有很大的差别?

  3. 通过 Dashboard 查下 慢SQL TOP 10,排查下 SQL 是否需要大量 内存会导致OOM的

  4. 开启定位能力,通过SQL 执行时长 和 SQL 执行占用的内存的设定,捕捉 SQL问题

  5. 可以尝试开启全局最大每条 SQL 最大的执行时长… 超时则会 kill SQL,减缓 tidb OOM的问题

另外请给出具体的集群配置和版本信息,帮助判断

global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /data/tidb-deploy
data_dir: /data/tidb-data
os: linux
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /data/tidb-deploy/monitor-9100
data_dir: /data/tidb-data/monitor-9100
log_dir: /data/tidb-deploy/monitor-9100/log
server_configs:
tidb:
log.slow-threshold: 300
mem-quota-query: 4294967296
tidb_servers:

  • host: 192.168.16.196
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: /data/tidb-deploy/tidb-4000
    log_dir: /data/tidb-deploy/tidb-4000/log
    arch: amd64
    os: linux

老师帮忙看下。

三个tidbserver实例 其中196的配置项有重复数据,三份一样的? 这个是哪的配置问题吗?

它提示的这个你调整过了吗
Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold)

没有调整过

  1. 数据库什么版本?
  2. 麻烦发下 tidb.toml 在 tidb deploy/config 目录下

老师,帮忙看下。

WARNING: This file is auto-generated. Do not edit! All your modification will be overwritten!

You can use ‘tiup cluster edit-config’ and ‘tiup cluster reload’ to update the configuration

All configuration items you want to change can be added to:

server_configs:

tidb:

aa.b1.c3: value

aa.b2.c4: value

mem-quota-query = 4294967296
new_collations_enabled_on_first_bootstrap = true

[log]
slow-threshold = 300

哪个版本?

tidb6.1

哦哦,那重启跟这个配置没关系。我测了下虽然这个参数配的不对,但与 TiDB 重启问题应该不相关。
建议按照 xfworld 提供的几点先进行排查,如排查无果,可以发下重启时间点前后 10min 的 tidb.log

tidb_192.168.16.196_4000.log (111.8 KB)

老师,你看下日志。

  1. 应该是使用了 prepare 导致的,生产环境吗?–> from https://docs.pingcap.com/zh/tidb/stable/sql-prepared-plan-cache#执行计划缓存
  2. 需要吧之前 prepare sql 重新执行一遍。

老师,是生产环境,我先看下你的回复。

系统日志如下:
Aug 16 17:48:15 TIDB-PD1 systemd: Created slice User Slice of tidb.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: New session 6512 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd: Started Session 6512 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: Removed session 6512.
Aug 16 17:48:15 TIDB-PD1 systemd: Removed slice User Slice of tidb.
Aug 16 17:48:15 TIDB-PD1 systemd: Created slice User Slice of tidb.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: New session 6513 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd: Started Session 6513 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd: Reloading.
Aug 16 17:48:15 TIDB-PD1 systemd: Started blackbox_exporter service.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: Removed session 6513.
Aug 16 17:48:15 TIDB-PD1 systemd: Removed slice User Slice of tidb.
Aug 16 17:48:15 TIDB-PD1 bash: level=info ts=2022-08-16T09:48:15.669186844Z caller=main.go:213 msg=“Starting blackbox_exporter” version="(version=0.12.0, branch=HEAD, revision=4a22506cf0cf139d9b2f9cde099f0012d9fcabde)"
Aug 16 17:48:15 TIDB-PD1 bash: level=info ts=2022-08-16T09:48:15.669802622Z caller=main.go:220 msg=“Loaded config file”
Aug 16 17:48:15 TIDB-PD1 bash: level=info ts=2022-08-16T09:48:15.669927186Z caller=main.go:324 msg=“Listening on address” address=:9115
Aug 16 17:48:15 TIDB-PD1 systemd: Created slice User Slice of tidb.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: New session 6514 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd: Started Session 6514 of user tidb.
Aug 16 17:48:15 TIDB-PD1 systemd-logind: Removed session 6514.
Aug 16 17:48:15 TIDB-PD1 systemd: Removed slice User Slice of tidb.
Aug 16 17:51:25 TIDB-PD1 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 16 17:51:25 TIDB-PD1 systemd: Unit tidb-4000.service entered failed state.
Aug 16 17:51:25 TIDB-PD1 systemd: tidb-4000.service failed.
Aug 16 17:51:40 TIDB-PD1 systemd: tidb-4000.service holdoff time over, scheduling restart.
Aug 16 17:51:40 TIDB-PD1 systemd: Stopped tidb service.
Aug 16 17:51:40 TIDB-PD1 systemd: Started tidb service.
Aug 16 17:51:40 TIDB-PD1 bash: [2022/08/16 17:51:40.508 +08:00] [WARN] [config.go:1004] [“Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold).”]
Aug 16 17:51:46 TIDB-PD1 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 16 17:51:46 TIDB-PD1 systemd: Unit tidb-4000.service entered failed state.
Aug 16 17:51:46 TIDB-PD1 systemd: tidb-4000.service failed.
Aug 16 17:52:01 TIDB-PD1 systemd: tidb-4000.service holdoff time over, scheduling restart.
Aug 16 17:52:01 TIDB-PD1 systemd: Stopped tidb service.
Aug 16 17:52:01 TIDB-PD1 systemd: Started tidb service.
Aug 16 17:52:01 TIDB-PD1 bash: [2022/08/16 17:52:01.256 +08:00] [WARN] [config.go:1004] [“Some configuration options should be moved to [instance] section. Please use the latter config options in [instance] instead: (slow-threshold, tidb_slow_log_threshold).”]
Aug 16 18:00:01 TIDB-PD1 systemd: Started Session 6515 of user root.

时间17:48:15 -> 时间17:51:25

这个tidb节点 这段时间内执行了一些啥东东?

还没解决吗? 重启应用后 tidb.log 报错还是一样的吗?
如果还是一样的,tidb 开下 debug log 再贴一份 tidb.log 上来。

  1. 修改 tidb debug log level
tiup cluster edit-config szp-test
修改参数
tiup cluster reload szp-test - R tidb

image
2. 采集 tidb.log 发上来

对了,tidb_stderr.log 也拿一下吧

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。