tiflash报错:Error Code: 1105. rpc error: code = Unavailable desc = error reading from server: EOF 22.359 sec

5个表,数据比较大,有做tiflash副本,执行时报错:Error Code: 1105. rpc error: code = Unavailable desc = error reading from server: EOF 22.359 sec

查询时cpu和内存使用率比较低(服务器:32C+128G)

dashboard查看tiflash日志:
2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:562] [“connection aborted”] [addr=192.168.0.106:20170] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: [] })))”] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:858] [“connection abort”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:562] [“connection aborted”] [addr=192.168.0.106:20170] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: [] })))”] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:858] [“connection abort”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:59 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:59 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:37:04 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:37:04 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

来个tiup 的集群配置清单和状态吧

贴了这么多,也不知道 之间的关系,也不清楚这些IP 是干嘛的(为啥不按照发帖的要求来填充内容呢)

以按照https://docs.pingcap.com/zh/tidb/stable/tune-tiflash-performance优化过tiflash,目前是服务器的cpu和内存没有发挥出来,执行过程中断了,tiflash节点直接蹦了

网络带宽够么?

看日志上的错误,就是几个 tiflash 的节点的网络通讯中断了

192.168.0.106:20170 这是个什么服务?

TiFlash,我明天测试下6.1版本

我的5个表联合查询,都是200万左右数据并且很多都是全表查询,6.1,6.5测试都是把fliash节点给拉挂了。可能是这种数据条件,查询不太支持或者优化没有做好。不折腾哈了,改用数据治理,清洗方案。谢谢各位

已经放弃了? 看上下文,还根因未明。 不知道原因的放弃了?

基于我的数据及查询条件,测试6.1,6.5,flash节点被拉崩了,好像里面还有keeplive,提示出问题。都是提示flash节点无法连接。即使我的硬件条件使用率还不高。个人理解主要3种可能性:1. 我的数据很多条件没有做索引,导致全表扫描,2. 没有深入优化。3. 目前的系统还无法支撑。项目比较急,毕竟用的开源的,没有服务,要求不能太高,靠群里每天来回沟通,影响项目推进,哈

我的硬件条件使用率还不高

什么配置?

所有节点:32C+128G+SSD

上个执行计划看看,这么好的硬件资源别浪费了 :heart_eyes:


id,estRows,task,“access object”,“operator info”
Sort_16,609983.41,root,smart_car_prod.qood_user.id:desc
└─Projection_18,609983.41,root,“smart_car_prod.qood_user.id, ifnull(timestampdiff(DAY, Column#239, 2023-03-25 16:25:43), 0)->Column#245, ifnull(Column#240, 0)->Column#246, Column#241, timestampdiff(DAY, smart_car_prod.qood_user.createtime, Column#239)->Column#247, timestampdiff(DAY, smart_car_prod.qood_user.createtime, 2023-03-25 16:25:43)->Column#248, timestampdiff(DAY, Column#242, 2023-03-25 16:25:43)->Column#249, Column#243, Column#244”
" └─HashAgg_19",609983.41,root,“group by:Column#275, funcs:max(Column#267)->Column#239, funcs:sum(Column#268)->Column#240, funcs:count(Column#269)->Column#241, funcs:max(Column#270)->Column#242, funcs:count(Column#271)->Column#243, funcs:sum(Column#272)->Column#244, funcs:firstrow(Column#273)->smart_car_prod.qood_user.id, funcs:firstrow(Column#274)->smart_car_prod.qood_user.createtime”
" └─Projection_178",45142377.95,root,“smart_car_prod.qood_order.updatetime, smart_car_prod.qood_order.orderprice, smart_car_prod.qood_order.id, smart_car_prod.login_info.logintime, case(and(eq(smart_car_prod.tb_activity_ticket.categoryid, 3), eq(smart_car_prod.tb_activity_ticket.state, 1)), smart_car_prod.tb_activity_ticket.id, )->Column#271, case(eq(smart_car_prod.qood_user_icoin_account_record.status, 1), smart_car_prod.qood_user_icoin_account_record.icoin, 0)->Column#272, smart_car_prod.qood_user.id, smart_car_prod.qood_user.createtime, smart_car_prod.qood_user.id”
" └─HashJoin_33",45142377.95,root,“left outer join, equal:[eq(smart_car_prod.qood_user.id, smart_car_prod.login_info.qooduserid)]”
" ├─TableReader_136(Build)“,3794487.00,root,data:Selection_135
" │ └─Selection_135”,3794487.00,cop[tiflash],not(isnull(smart_car_prod.login_info.qooduserid))
" │ └─TableFullScan_134",3794487.00,cop[tiflash],table:l,“keep order:false”
" └─HashJoin_62(Probe)“,7256870.74,root,“left outer join, equal:[eq(smart_car_prod.qood_user.id, smart_car_prod.qood_user_icoin_account_record.qooduserid)]”
" ├─HashJoin_64(Build)”,1684930.00,root,“left outer join, equal:[eq(smart_car_prod.qood_user.id, smart_car_prod.tb_activity_ticket.userid)]”
" │ ├─Projection_65(Build)“,609983.41,root,“smart_car_prod.qood_user.id, smart_car_prod.qood_user.createtime, smart_car_prod.qood_order.id, smart_car_prod.qood_order.orderprice, smart_car_prod.qood_order.updatetime”
" │ │ └─IndexHashJoin_75”,609983.41,root,“inner join, inner:TableReader_69, outer key:smart_car_prod.qood_order.qooduserid, inner key:smart_car_prod.qood_user.id, equal cond:eq(smart_car_prod.qood_order.qooduserid, smart_car_prod.qood_user.id)”
" │ │ ├─TableReader_105(Build)“,606489.02,root,data:Selection_104
" │ │ │ └─Selection_104”,606489.02,cop[tiflash],“eq(smart_car_prod.qood_order.orderstatus, 2), eq(smart_car_prod.qood_order.paystatus, 1), ne(smart_car_prod.qood_order.paymentproductid, “beanrechargecard”), ne(smart_car_prod.qood_order.paymentproductid, “icoinrechargecard”)”
" │ │ │ └─TableFullScan_103",1970842.00,cop[tiflash],table:o,“keep order:false”
" │ │ └─TableReader_69(Probe)“,1.00,root,data:Selection_68
" │ │ └─Selection_68”,1.00,cop[tikv],“gt(smart_car_prod.qood_user.createtime, 2016-01-01 00:00:00.000000), lt(smart_car_prod.qood_user.createtime, 2023-04-01 00:00:00.000000)”
" │ │ └─TableRangeScan_67",1.00,cop[tikv],table:a,“range: decided by [smart_car_prod.qood_order.qooduserid], keep order:false”
" │ └─TableReader_126(Probe)“,1684930.00,root,data:Selection_125
" │ └─Selection_125”,1684930.00,cop[tiflash],not(isnull(smart_car_prod.tb_activity_ticket.userid))
" │ └─TableFullScan_124",1684930.00,cop[tiflash],table:t,“keep order:false”
" └─TableReader_133(Probe)“,3121656.00,root,data:TableFullScan_132
" └─TableFullScan_132”,3121656.00,cop[tiflash],table:i,“keep order:false”

不过这是另外一个环境,配置低一点,高配环境系统重装了,还没部署

大部分的还是走的 聚合操作,下推比较少,对 tidb 的考验比较大

是的,有什么好的建议优化?

调整结构,优化索引,尽量使用下推的方式,实现并行计算达成提速的目标

旧系统数据,通过DM过来的,旧系统改动大,所以上面才说放弃,改用其它方案哈