v5.0.1环境tiflash pod重启导致查询报错

【TiDB 版本】:
5.0.1
【TiDB Operator 版本】:
1.1.4

tiflash pod重启后(重启原因是调整tidb-cluster.yaml 的参数部署导致pod自动重启),pod状态正常:


sql查询报错:

tiflash日志:

集群状态:
/ # ./pd-ctl store
{
“count”: 5,
“stores”: [
{
“store”: {
“id”: 100,
“address”: “rface-tidb-tiflash-0.rface-tidb-tiflash-peer.rface-infra.svc:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.1”,
“peer_address”: “rface-tidb-tiflash-0.rface-tidb-tiflash-peer.rface-infra.svc:20170”,
“status_address”: “rface-tidb-tiflash-0.rface-tidb-tiflash-peer.rface-infra.svc:20292”,
“git_hash”: “1821cf655bc90e1fab6e6154cfe994c19c75d377”,
“start_timestamp”: 1622779091,
“deploy_path”: “/tiflash”,
“last_heartbeat”: 1622798727267582255,
“state_name”: “Up”
},
“status”: {
“capacity”: “918.7GiB”,
“available”: “851.8GiB”,
“used_size”: “66.91GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 2686,
“region_weight”: 1,
“region_score”: 378952.39499776746,
“region_size”: 285965,
“start_ts”: “2021-06-04T11:58:11+08:00”,
“last_heartbeat_ts”: “2021-06-04T17:25:27.267582255+08:00”,
“uptime”: “5h27m16.267582255s”
}
},
{
“store”: {
“id”: 101,
“address”: “rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.1”,
“peer_address”: “rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:20170”,
“status_address”: “rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:20292”,
“git_hash”: “1821cf655bc90e1fab6e6154cfe994c19c75d377”,
“start_timestamp”: 1622779043,
“deploy_path”: “/tiflash”,
“last_heartbeat”: 1622798727904168244,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.746TiB”,
“available”: “1.672TiB”,
“used_size”: “75.21GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 2991,
“region_weight”: 1,
“region_score”: 371965.0928143375,
“region_size”: 319582,
“start_ts”: “2021-06-04T11:57:23+08:00”,
“last_heartbeat_ts”: “2021-06-04T17:25:27.904168244+08:00”,
“uptime”: “5h28m4.904168244s”
}
},
{
“store”: {
“id”: 1,
“address”: “rface-tidb-tikv-2.rface-tidb-tikv-peer.rface-infra.svc:20160”,
“version”: “5.0.1”,
“status_address”: “rface-tidb-tikv-2.rface-tidb-tikv-peer.rface-infra.svc:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1622787941,
“deploy_path”: “/”,
“last_heartbeat”: 1622798733650881784,
“state_name”: “Up”
},
“status”: {
“capacity”: “434.5GiB”,
“available”: “141.9GiB”,
“used_size”: “191GiB”,
“leader_count”: 2249,
“leader_weight”: 1,
“leader_score”: 2249,
“leader_size”: 242041,
“region_count”: 6763,
“region_weight”: 1,
“region_score”: 1679975.1890792924,
“region_size”: 704261,
“start_ts”: “2021-06-04T14:25:41+08:00”,
“last_heartbeat_ts”: “2021-06-04T17:25:33.650881784+08:00”,
“uptime”: “2h59m52.650881784s”
}
},
{
“store”: {
“id”: 4,
“address”: “rface-tidb-tikv-0.rface-tidb-tikv-peer.rface-infra.svc:20160”,
“version”: “5.0.1”,
“status_address”: “rface-tidb-tikv-0.rface-tidb-tikv-peer.rface-infra.svc:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1622787832,
“deploy_path”: “/”,
“last_heartbeat”: 1622798727654279608,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.746TiB”,
“available”: “1.22TiB”,
“used_size”: “179.1GiB”,
“leader_count”: 2256,
“leader_weight”: 1,
“leader_score”: 2256,
“leader_size”: 230685,
“region_count”: 6763,
“region_weight”: 1,
“region_score”: 844169.4701096346,
“region_size”: 704261,
“start_ts”: “2021-06-04T14:23:52+08:00”,
“last_heartbeat_ts”: “2021-06-04T17:25:27.654279608+08:00”,
“uptime”: “3h1m35.654279608s”
}
},
{
“store”: {
“id”: 5,
“address”: “rface-tidb-tikv-1.rface-tidb-tikv-peer.rface-infra.svc:20160”,
“version”: “5.0.1”,
“status_address”: “rface-tidb-tikv-1.rface-tidb-tikv-peer.rface-infra.svc:20180”,
“git_hash”: “e26389a278116b2f61addfa9f15ca25ecf38bc80”,
“start_timestamp”: 1622787889,
“deploy_path”: “/”,
“last_heartbeat”: 1622798732136948894,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.746TiB”,
“available”: “1.22TiB”,
“used_size”: “179.8GiB”,
“leader_count”: 2258,
“leader_weight”: 1,
“leader_score”: 2258,
“leader_size”: 231535,
“region_count”: 6763,
“region_weight”: 1,
“region_score”: 844169.4701096346,
“region_size”: 704261,
“start_ts”: “2021-06-04T14:24:49+08:00”,
“last_heartbeat_ts”: “2021-06-04T17:25:32.136948894+08:00”,
“uptime”: “3h0m43.136948894s”
}
}
]
image
tidb日志;
tiflash_tidb.log (2.2 MB)

疑似 TiFlash OOM ,可以先排查一下 TiFlash 是否发生重启。另外以及升级到。可以在对应的 POD 里面的 /var/dmsg/ 日志看一下

[2021/06/04 12:25:33.246 +08:00] [INFO] [region_cache.go:740] ["switch region tiflash peer to next due to send request fail"] [conn=30021] [current="region ID: 8010, meta: id:8010 start_key:\"t\\200\\000\\000\\000\\000\\000\\000H_r\\200\\000\\000\\0051\\355\\241\\027\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000H_r\\200\\000\\000\\0051\\356\\361\\336\" region_epoch:<conf_ver:6 version:2005 > peers:<id:8011 store_id:1 > peers:<id:8012 store_id:5 > peers:<id:8013 store_id:4 > peers:<id:22241 store_id:101 role:Learner > , peer: id:22241 store_id:101 role:Learner , addr: rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash"] [needReload=true] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.96.99:3930: connect: connection refused\""]
[2021/06/04 12:25:33.246 +08:00] [INFO] [region_cache.go:740] ["switch region tiflash peer to next due to send request fail"] [conn=30021] [current="region ID: 8018, meta: id:8018 start_key:\"t\\200\\000\\000\\000\\000\\000\\000H_r\\200\\000\\000\\0051\\360K\\236\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000H_r\\200\\000\\000\\0051\\361\\243l\" region_epoch:<conf_ver:6 version:2005 > peers:<id:8019 store_id:1 > peers:<id:8020 store_id:5 > peers:<id:8021 store_id:4 > peers:<id:22215 store_id:101 role:Learner > , peer: id:22215 store_id:101 role:Learner , addr: rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash"] [needReload=true] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.96.99:3930: connect: connection refused\""]
[2021/06/04 12:25:33.246 +08:00] [INFO] [region_cache.go:740] ["switch region tiflash peer to next due to send request fail"] [conn=30021] [current="region ID: 8026, meta: id:8026 start_key:\"t\\200\\000\\000\\000\\000\\000\\000H_r\\200\\000\\000\\0051\\362\\375t\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000I\" region_epoch:<conf_ver:6 version:2008 > peers:<id:8027 store_id:1 > peers:<id:8028 store_id:5 > peers:<id:8029 store_id:4 > peers:<id:22181 store_id:101 role:Learner > , peer: id:22181 store_id:101 role:Learner , addr: rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash"] [needReload=true] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.96.99:3930: connect: connection refused\""]
[2021/06/04 12:25:33.338 +08:00] [ERROR] [mpp.go:224] ["mpp dispatch meet io error"] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.96.99:3930: connect: connection refused\""]
[2021/06/04 12:25:33.339 +08:00] [ERROR] [mpp.go:310] ["establish mpp connection meet error"] [error="rpc error: code = Canceled desc = context canceled"]
[2021/06/04 12:25:33.341 +08:00] [ERROR] [mpp.go:288] ["cancel task error: "] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.96.99:3930: connect: connection refused\""] [" for query id "=425402232348606470] [" on addr "=rface-tidb-tiflash-1.rface-tidb-tiflash-peer.rface-infra.svc:3930]
[2021/06/04 12:25:33.341 +08:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=30021] [connInfo="id:30021, addr:192.168.5.0:34688 status:10, collation:utf8_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="SELECT Count(1) AS c FROM   real_items WHERE  1 = 1        AND distroy is NULL        AND base = 1"] [txn_mode=PESSIMISTIC] [err="[tikv:9012]TiFlash server timeout\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\
github.com/pingcap/errors.Trace\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\
github.com/pingcap/tidb/store/copr.(*mppIterator).Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/copr/mpp.go:426\
github.com/pingcap/tidb/distsql.(*selectResult).fetchResp\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/distsql/select_result.go:144\
github.com/pingcap/tidb/distsql.(*selectResult).Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/distsql/select_result.go:208\
github.com/pingcap/tidb/executor.(*MPPGather).Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/mpp_gather.go:134\
github.com/pingcap/tidb/executor.Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/executor.go:277\
github.com/pingcap/tidb/executor.(*HashAggExec).fetchChildData\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/aggregate.go:730\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]