read: connection reset by peer

【 TiDB 使用环境】生产环境
【 TiDB 版本】:v6.5.2
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】
应用程序报错:


应用报错代码:

Cause: java.sql.SQLException: rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer

; uncategorized SQLException; SQL state [HY000]; error code [1105]; rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer; nested exception is java.sql.SQLException: rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer
节点说明:172.16.89.81(TiDB Server),172.16.89.85(TiFlash)
【附件:截图/日志/监控】

TiDB Server(172.16.89.80)错误日志(大量报错)
[2023/10/12 08:40:00.388 +08:00] [ERROR] [ddl_tiflash_api.go:396] [“get tiflash sync progress failed”] [error=“Get "http://172.16.89.85:20292/tiflash/sync-status/21743\”: dial tcp 172.16.89.85:20292: connect: connection refused"] [tableID=21743] [IsPartition=false]
[2023/10/12 08:40:00.389 +08:00] [ERROR] [tiflash_manager.go:93] [“Fail to get peer status from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.390 +08:00] [ERROR] [tiflash_manager.go:119] [“Fail to get peer count from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.390 +08:00] [ERROR] [ddl_tiflash_api.go:396] [“get tiflash sync progress failed”] [error=“Get "http://172.16.89.85:20292/tiflash/sync-status/21743\”: dial tcp 172.16.89.85:20292: connect: connection refused"] [tableID=21743] [IsPartition=false]
[2023/10/12 08:40:00.391 +08:00] [ERROR] [tiflash_manager.go:93] [“Fail to get peer status from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.391 +08:00] [ERROR] [tiflash_manager.go:119] [“Fail to get peer count from TiFlash.”] [tableID=21743]

TiFlash节点(172.16.89.85)日志(tiflash_error.log):
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 531389, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=98]
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 531997, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=84]
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 533871, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=81]
[2023/10/12 08:39:56.458 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 533055, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=83]
[2023/10/12 08:41:17.801 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 0”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:18.816 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 1”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:20.236 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 2”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:23.080 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 3”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:25.593 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 4”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:26.621 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 5”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:27.634 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 6”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTaskManager.cpp:152] [“Begin to abort query: 444877052576792581, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB”] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTaskManager.cpp:195] ["Remaining task in query 444877052576792581 are: MPPquery:444877052576792581:3,task MPPquery:444877052576792581:6,task MPPquery:444877052576792581:16,task MPPquery:444877052576792581:22,task MPPquery:444877052576792581:19,task MPPquery:444877052576792581:9,task MPPquery:444877052576792581:5,task MPPquery:444877052576792581:18,task MPPquery:444877052576792581:21,task MPPquery:444877052576792581:1,task MPPquery:444877052576792581:13,task "] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:3,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:3,task] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:3,task] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:6,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:6,task] [thread_id=97]
[2023/10/12 08:41:27.886 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=385]
[2023/10/12 08:41:27.886 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:6,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:16,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:16,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:16,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:22,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:22,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=393]
[2023/10/12 08:41:27.897 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=524]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=533]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=382]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=1827]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=515]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_216”] [thread_id=497]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=518]

报错时段Request Duration较高

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

检查TiFlash节点的网络连接是否正常,是否有防火墙或其他因素阻碍了TiDB Server和TiFlash之间的通信,或者TiFlash数据文件是否损坏

1 个赞


TiFlash服务挂了

网络中断了,应该是网络问题

两台TiFlash节点的Uptime都置零了,服务自动重启了。现在在排查重启原因

故障处理过程
故障现象:


TiFlash 节点频繁重启

TiDB Server 节点频繁重启


问题期间大量内存占用高的SQL在执行。
处理办法:
设置SQL最大内存使用量为10GB:
SET global tidb_mem_quota_query = 10 << 30;
禁止内存使用超10GB SQL执行
set global tidb_mem_oom_action=‘CANCEL’;

原配置:
SET tidb_mem_quota_query = 24 << 30;
set global tidb_mem_oom_action=‘LOG’;

修改配置后系统运行情况:


限制内存使用后,TiDB Server、TiFlash未再重启;


SQL 内存使用超10GB后自动中断


SQL优化上线后故障解决。

你这sql得好好优化下

TiDB在我们这边做报表和分析用。确实该好好优化

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。