tiflash由于句柄数升高导致重启

Hacker_ojLJ8Ndr · 2022 年7 月 12 日 03:42

【 TiDB 使用环境`】生产环境
【 TiDB 版本】
6.1.0
【遇到的问题】
tiflash重启
【问题现象及影响】
tiflash_error.log：

tiflash-summary ：

tidb-cluster-node_exporter：

操作系统配置：
%E5%9B%BE%E7%89%87

数据小黑 · 2022 年7 月 12 日 06:36

系统的message里面有什么提示信息么？有没有oom-killer一类的提示？

Hacker_ojLJ8Ndr · 2022 年7 月 12 日 06:46

不是oom造成的，message显示：main process exited, code=killed

Aric · 2022 年7 月 18 日 02:31

根因排查：

确实 open file count 比较高，但 meaasge 中有出现 open too many file descriptor … 吗？
6770 不一定是是真实的极限值，有可能被均掉了，最值可能比这个大。

至于规避，如果是这个根因，调成 unlimit 看能不能绕过？

Hacker_ojLJ8Ndr · 2022 年7 月 18 日 02:56

1./var/log/messages 中，没有关于文件描述符的报错，只有这个：main process exited, code=killed, status=6/ABRT
2.操作系统资源已经做了配置，您看我写的操作系统配置的截图，配置是生效的，但 tiflash-summary 里的 Opened File Count 的最大值与操作系统配置值差距巨大，无法通过资源限制规避这个问题

Aric · 2022 年7 月 18 日 03:14

使用 9090 端口看下 Prometheus 该指标记录的瞬时值（也是均出来的）是怎样的；
这 tiflash 重启时候的日志（tiflash.log tiflash_error.log）还有吗？麻烦贴一下

目前的信息不容易往下分析…

Hacker_ojLJ8Ndr · 2022 年7 月 18 日 03:51

prometheus 记录：

%E5%9B%BE%E7%89%871861×698 77.1 KB

2.tiflash_error.log 第一次出现报错的截图：

%E5%9B%BE%E7%89%871893×173 9.78 KB

3.这个是从dashboard上导出的 tiflash 问题节点的日志：
logs-tiflash_192.168.14.23_3930.zip (1.0 MB)

Aric · 2022 年7 月 18 日 05:31

麻烦确认下这是 tiflash 日志吗？看着好像是 tikv 的日志。

Hacker_ojLJ8Ndr · 2022 年7 月 18 日 06:02

Aric · 2022 年7 月 18 日 06:36

ng_monitor 导出的应该只是 tiflash_tikv.log，麻烦到该目录下把这 4 对应时间点的 log 全取出来。

Hacker_ojLJ8Ndr · 2022 年7 月 18 日 06:53

是的，导出的是 tiflash_tikv.log , 当时的 tiflash.log 没有了

Aric · 2022 年7 月 19 日 02:32

这个重启之后还经常出现吗？
tiflash_error.log 的完整日志还有吗？

Hacker_ojLJ8Ndr · 2022 年7 月 19 日 03:26

偶尔会出现，这个是问题时段的tiflash_error.log:
tiflash_error.log (115.3 KB)

Aric · 2022 年7 月 19 日 05:30

ok，没次重启都能成功，恢复正常？

Hacker_ojLJ8Ndr · 2022 年7 月 19 日 05:53

重启后句柄数都下去了，服务正常

Aric · 2022 年7 月 19 日 06:45

ok，

麻烦确认一下操作系统版本；
麻烦确认一下环境是否在 tiflash 上开启了 continuous profiling 功能（可以在 dashboard 上看到）；或者是否有手动 profiling 的情况；

可能是这个问题，还在确认中 --> https://github.com/pingcap/tiflash/issues/5292

Hacker_ojLJ8Ndr · 2022 年7 月 19 日 06:58

1.操作系统版本：CentOS Linux release 7.9.2009 (Core)
2.持续性能分析是开启状态，没有手动 profiling

Aric · 2022 年7 月 19 日 07:20

经判断基本跟这个 issue 相关 --> https://github.com/pingcap/tiflash/issues/5292
建议：

关闭 continuous profiling 观察下后续该现象是否还会出现；
可以 follow 下这个 issue，内部会持续排查并修复该问题；

有任何新进展、现象也可以，贴到这里。感谢反馈

Hacker_ojLJ8Ndr · 2022 年7 月 19 日 07:25

好的，感谢帮助~

system · 2022 年10 月 31 日 19:19

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。