在dashboard中查询sql日志超时报错

【 TiDB 使用环境】生产环境
【 TiDB 版本】v4.0.4
【复现路径】
在页面上查询"慢查询",“SQL语句分析"时超时报错: error.api.other: invalid connection
【遇到的问题:问题现象及影响】
在页面上查询"慢查询”,"SQL语句分析"后, 转圈很长时间, 然后超时报错: error.api.other: invalid connection


观察到点击按钮后tidb节点cpu使用率升高
image

观察到tidb的错误日志: 好像直接被重启了

goroutine 584 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x0)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdfa123b9f, 0xc002dcd550, 0x0, 0xc128aba8edf1ef63, 0x2172bd68fa)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdfa123b9f, 0xc002dcd550)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc0009d1240)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
panic: interface conversion: interface is nil, not core.Plan

goroutine 566 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x9)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdf9d21bff, 0xc0009ee790, 0x0, 0xc128ac027a19bd74, 0x41dbe15d9d)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdf9d21bff, 0xc0009ee790)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc001296900)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
panic: interface conversion: interface is nil, not core.Plan

goroutine 453 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x9)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdfbbd0340, 0xc003c42bb0, 0x0, 0xc128acf8763801fc, 0xd38998e6c7)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdfbbd0340, 0xc003c42bb0)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc0011d6e60)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
[2023/07/27 17:44:20.362 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=96549] [currIdx=0] [leaderStoreID=2]
[2023/07/27 17:44:20.362 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=86413] [currIdx=0] [leaderStoreID=2]
[2023/07/27 17:44:20.364 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=92513] [currIdx=2] [leaderStoreID=1]
[2023/07/27 17:44:31.945 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=50000]
[2023/07/27 17:44:32.824 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:30.91468386s txnStartTS:443141597919182873 region_id:0 store_addr:172.16.12.161:10080"] [conn=905]
[2023/07/27 17:44:32.824 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/27 17:44:39.323 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=18446744073709551615]
[2023/07/27 17:44:46.967 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=50000]
[2023/07/27 17:44:49.337 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=18446744073709551615]
[2023/07/27 17:45:17.141 +08:00] [INFO] [printer.go:33] ["Welcome to TiDB."] ["Release Version"=v4.0.4] [Edition=Community] ["Git Commit Hash"=c61fc7247e9f6bc773761946d5b5294d3f2699a5] ["Git Branch"=heads/refs/tags/v4.0.4] ["UTC Build Time"="2020-07-31 07:50:19"] [GoVersion=go1.13] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2023/07/27 17:45:17.142 +08:00] [INFO] [printer.go:47] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"172.16.14.3\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379\",\"socket\":\"\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/alidata/tidb/tmp\",\"oom-action\":\"log\",\"mem-quota-query\":1073741824,\"tmp-storage-quota\":-1,\"enable-streaming\":false,\"enable-batch-dml\":false,\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/alidata/tidb/deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/alidata/tidb/deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0.05,\"query-feedback-limit\":1024,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-total-size-limit\":2097152000,\"tcp-keep-alive\":true,\"cross-join\":true,\"run-auto-analyze\":true,\"agg-push-down-join\":false,\"committer-concurrency\":16,\"max-txn-ttl\":600000},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"commit-timeout\":\"41s\",\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"5s\",\"copr-cache\":{\"enable\":false,\"capacity-mb\":1000,\"admission-max-result-mb\":10,\"admission-min-process-ms\":5}},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"enable\":true,\"max-retry-count\":256},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":200,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":0,\"new_collations_enabled_on_first_bootstrap\":false,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":true}"]
[2023/07/27 17:45:17.142 +08:00] [INFO] [main.go:341] ["disable Prometheus push client"]
[2023/07/27 17:45:17.142 +08:00] [INFO] [store.go:68] ["new store"] [path=tikv://172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379]
[2023/07/27 17:45:17.143 +08:00] [INFO] [client.go:149] ["[pd] create pd client with endpoints"] [pd-address="[172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379]"]
[2023/07/27 17:45:17.143 +08:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:226] ["[pd] update member urls"] [old-urls="[http://172.16.14.10:2379,http://172.16.12.162:2379,http://172.16.15.227:2379]"] [new-urls="[http://172.16.12.162:2379,http://172.16.14.10:2379,http://172.16.15.227:2379]"]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://172.16.14.10:2379] [old-leader=]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=7222620707440961173]
[2023/07/27 17:45:17.146 +08:00] [INFO] [store.go:74] ["new store with retry success"]
[2023/07/27 17:45:17.155 +08:00] [INFO] [tidb.go:71] ["new domain"] [store=tikv-7222620707440961173] ["ddl lease"=45s] ["stats lease"=3s]
[2023/07/27 17:45:17.159 +08:00] [INFO] [ddl.go:316] ["[ddl] start DDL"] [ID=7b0dd95d-4435-40a1-be17-164fd805d90d] [runWorker=true]
[2023/07/27 17:45:17.159 +08:00] [INFO] [manager.go:185] ["start campaign owner"] [ownerInfo="[ddl] /tidb/ddl/fg/owner"]
[2023/07/27 17:45:17.162 +08:00] [INFO] [ddl.go:305] ["[ddl] start delRangeManager OK"] ["is a emulator"=false]
[2023/07/27 17:45:17.163 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 1, tp general"]
[2023/07/27 17:45:17.163 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 2, tp add index"]
[2023/07/27 17:45:17.334 +08:00] [INFO] [domain.go:145] ["full load InfoSchema success"] [usedSchemaVersion=0] [neededSchemaVersion=1045] ["start time"=134.432038ms]
[2023/07/27 17:45:17.337 +08:00] [INFO] [domain.go:369] ["full load and reset schema validator"]
[2023/07/27 17:45:17.357 +08:00] [INFO] [manager.go:185] ["start campaign owner"] [ownerInfo="[bindinfo] /tidb/bindinfo/owner"]
[2023/07/27 17:45:17.393 +08:00] [INFO] [manager.go:185] ["start campaign owner"] [ownerInfo="[stats] /tidb/stats/owner"]
[2023/07/27 17:45:17.393 +08:00] [INFO] [manager.go:185] ["start campaign owner"] [ownerInfo="[telemetry] /tidb/telemetry/owner"]
[2023/07/27 17:45:17.399 +08:00] [INFO] [gc_worker.go:163] ["[gc worker] start"] [uuid=6265af414dc0010]
[2023/07/27 17:45:17.407 +08:00] [INFO] [server.go:235] ["server is running MySQL protocol"] [addr=0.0.0.0:4000]
[2023/07/27 17:45:17.407 +08:00] [INFO] [http_status.go:81] ["for status and metrics report"] ["listening on addr"=0.0.0.0:10080]
[2023/07/27 17:45:17.859 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:456.691919ms txnStartTS:443141617711054860 region_id:95789 store_addr:172.16.14.4:20160 kv_process_ms:223 scan_total_write:3355 scan_processed_write:2749 scan_total_data:3145 scan_processed_data:2492 scan_total_lock:1 scan_processed_lock:0"]
[2023/07/27 17:45:18.451 +08:00] [INFO] [domain.go:1070] ["init stats info time"] ["take time"=1.05769285s]
[2023/07/27 17:45:49.537 +08:00] [INFO] [server.go:388] ["new connection"] [conn=33] [remoteAddr=172.16.15.228:50160]
[2023/07/27 17:45:49.545 +08:00] [INFO] [set.go:207] ["set session var"] [conn=33] [name=character_set_results] [val=NULL]
[2023/07/27 17:45:52.538 +08:00] [INFO] [server.go:388] ["new connection"] [conn=36] [remoteAddr=172.16.14.13:54686]
[2023/07/27 17:45:52.544 +08:00] [INFO] [set.go:207] ["set session var"] [conn=36] [name=character_set_results] [val=NULL]

【资源配置】
【附件:截图/日志/监控】

看一下你的slow日志,文件很大吗?

集群版本好低啊。。。

不大也不多, 每个文件300M
image

因为用ansible部署的, 有一套涉及kafka的同步链路, 而且产品稳定, 开发团队比较保守, 所以倾向于不升级

另一套配置和版本一模一样的tidb集群就秒加载, 所以很疑惑

mv 出去几个试试,留一个

  1. 你这一个tidb节点就300M*17/1024=4.98G 将近5G
    如果2个tidb,估计每个节点慢日志大小应该差不多,估算得10G
    手动清理下时间比较久远的慢日志吧,留着没什么大用~

  2. 平时该优化得优化

刚试了下把所有的节点的log文件挪的都只剩一个很小的, 还是会同样的报错现象

所有的tidb节点都要清理下~

看看日志里有没有相关的报错,以及监控

是的, 我已经把三个tidb-server上的日志都清理了

现在情况依然是, 我点一次查询按钮, 就会有一个随机的tidb实例报错:

[2023/07/28 10:48:36.558 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=95789] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:36.560 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=96185] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:37.209 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=93397] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:37.229 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=54861] [currIdx=1] [leaderStoreID=2]
[2023/07/28 10:48:49.108 +08:00] [INFO] [server.go:388] ["new connection"] [conn=971] [remoteAddr=172.16.12.162:54160]
[2023/07/28 10:48:50.120 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:995.561644ms txnStartTS:443157716432846849 region_id:0 store_addr:172.16.15.225:10080"] [conn=971]
[2023/07/28 10:48:50.120 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/28 10:48:51.618 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=400] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:55.276 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:6.151780699s txnStartTS:443157716432846849 region_id:0 store_addr:172.16.14.3:10080"] [conn=971]
[2023/07/28 10:48:55.276 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/28 10:49:12.717 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=54861] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:50:04.501 +08:00] [INFO] [printer.go:33] ["Welcome to TiDB."] ["Release Version"=v4.0.4] [Edition=Community] ["Git Commit Hash"=c61fc7247e9f6bc773761946d5b5294d3f2699a5] ["Git Branch"=heads/refs/tags/v4.0.4] ["UTC Build Time"="2020-07-31 07:50:19"] [GoVersion=go1.13] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]

然后该tidb-server重启, 另外两个tidb-server未发现这个错误并且不重启

每次都是随机一个tidb-server报错重启
所以上午已经轮流被重启好多次了
image

监控上看tidb-server所有节点cpu使用率暴增, 但有时候所有都暴增, 有时一两个, 就算结合重启现象, 也看不出这个规律, 没有其他告警和监控异常

感觉像是碰bug了 :smiling_imp:
有个类似的帖子,不过他的版本是5.X了,并且tidb也没有重启。。。
invalid cop task execution summaries length

感觉不太像…我尝试晚上业务低峰期重启一下pd组件, 然后把日志debug开一下

先列出如下监控截图
1、tidb 监控面板的 连接数
2、tidb 实例的cpu 、内存使用率

好的, 我晚上业务高峰期过了集中做几次实验后贴上来, 白天干扰信息有点多

重启pd后已恢复,所有现象都未再复现,这是之前出现问题时的问题点10:48分的监控