tidb 连接自动断开

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

我们这边的监控,发现,tidb4.0集群 会自动在某个特定时间点 杀连接。


上图 可见: 只杀一部分,不会全部杀掉。当时,系统 整体正常,tidb server节点 没有oom

原来在老的tidb 2.0集群 没有这个问题,迁移到tidb4.0后,就会自动杀

想知道,tidb 4.0 有什么默认参数控制这个连接嘛?

  1. 麻烦上传各个 tidb.log 10:01 包含这个时间段 10分钟的日志
  2. 同时上传操作系统这段时间段的 message 日志

在 message 里找到了问题,tidb server 进程 退出 重启。 这个不是以往的OOM

Sep 20 09:50:01 prod-tidb4-e001 systemd: Started Session 16214 of user root.
Sep 20 10:00:01 prod-tidb4-e001 systemd: Started Session 16215 of user root.
Sep 20 10:01:01 prod-tidb4-e001 systemd: Started Session 16216 of user root.
Sep 20 10:01:01 prod-tidb4-e001 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 20 10:01:01 prod-tidb4-e001 systemd: Unit tidb-4000.service entered failed state.
Sep 20 10:01:01 prod-tidb4-e001 systemd: tidb-4000.service failed.
Sep 20 10:01:16 prod-tidb4-e001 systemd: tidb-4000.service holdoff time over, scheduling restart.
Sep 20 10:01:16 prod-tidb4-e001 systemd: Stopped tidb-4000 service.
Sep 20 10:01:16 prod-tidb4-e001 systemd: Started tidb-4000 service.
Sep 20 10:10:01 prod-tidb4-e001 systemd: Started Session 16217 of user root.
Sep 20 10:20:01 prod-tidb4-e001 systemd: Started Session 16218 of user root.

附上重启前后的log

[2020/09/23 10:01:02.285 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:488.887264ms txnStartTS:419647036389589002 region_id:84481 store_addr:172.xx.xx.88:20160 kv_wait_ms:474"] [conn=290391]
[2020/09/23 10:01:02.286 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:521.703821ms txnStartTS:419647036402696227 region_id:84481 store_addr:172.xx.xx.88:20160 kv_wait_ms:506"] [conn=321933]
[2020/09/23 10:01:02.289 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:481.530203ms txnStartTS:419647036402696227 region_id:84481 store_addr:172.xx.xx.88:20160 kv_wait_ms:466"] [conn=321933]
[2020/09/23 10:01:02.290 +08:00] [INFO] [server.go:388] ["new connection"] [conn=322497] [remoteAddr=172.xx.xx.121:53370]
[2020/09/23 10:01:02.290 +08:00] [INFO] [set.go:207] ["set session var"] [conn=322497] [name=max_execution_time] [val=120000]
[2020/09/23 10:01:02.290 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:478.790724ms txnStartTS:419647036402696227 region_id:84481 store_addr:172.xx.xx.88:20160 kv_wait_ms:463"] [conn=321933]
[2020/09/23 10:01:02.295 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:499.70029ms txnStartTS:419647036402696227 region_id:84481 store_addr:172.xx.xx.88:20160 kv_wait_ms:477"] [conn=321933]
[2020/09/23 10:01:02.349 +08:00] [WARN] [expensivequery.go:167] [expensive_query] [cost_time=60.07848135s] [cop_time=237.113168195s] [process_time=104.791s] [wait_time=1685.867s] [backoff_time=0.064s] [request_count=50409] [total_keys=11099941] [process_keys=10882522] [num_cop_tasks=50409] [process_avg_time=0.002078815s] [process_p90_time=0.004s] [process_max_time=0.205s] [process_max_addr=172.xx.xx.90:20160] [wait_avg_time=0.033443769s] [wait_p90_time=0.052s] [wait_max_time=8.981s] [wait_max_addr=172.xx.xx.90:20160] [conn_id=290391] [user=tidb-dp-xx] [table_ids="[109]"] [index_names="[portfolio_order:idx_broker_account3id]"] [txn_start_ts=419647036389589002] [mem_max="79047257 Bytes (75.38533878326416 MB)"] [sql="\t\t\t\t\t\t\tselect count(distinct po.account3_id) as ucnt, coalesce(sum(pso.redeem_share_asset), 0) as amount\
\t\t\t\t\t\t\tfrom (\
\t\t\t\t\t\t\t\tselect * from ying99_potxx.portxx_xx\
\t\t\t\t\t\t\t\twhere po_code = ? and confirm_stage != 'P9' and broker = '0008' and redeem_share_asset is not null\
\t\t\t\t\t\t\t\t\tand expect_confirm_date between ? and ?\
\t\t\t\t\t\t\t) po join ying99_potxx.po_xxx pso \
\t\t\t\t\t\t\t\ton pso.order_id = po.order_id and pso.confirm_status != '9' and pso.redeem_share_asset is not null"]
[2020/09/23 10:01:17.635 +08:00] [INFO] [printer.go:42] ["Welcome to TiDB."] ["Release Version"=v4.0.1] [Edition=Community] ["Git Commit Hash"=689a6b6439ae7835947fcaccf329a3fc303986cb] ["Git Branch"=heads/v4.0.1] ["UTC Build Time"="2020-06-15 03:54:18"] [GoVersion=go1.13] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2020/09/23 10:01:17.636 +08:00] [INFO] [printer.go:56] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"172.xx.xx.83\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"172.xx.xx.84:2379,172.18.240.86:2379,172.xx.xx.85:2379\",\"socket\":\"\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/1000_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"oom-action\":\"cancel\",\"mem-quota-query\":26843545600,\"tmp-storage-quota\":-1,\"enable-streaming\":false,\"enable-batch-dml\":false,\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data1/deploy/tidb/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/data1/deploy/tidb/log/tidb_slow_query.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0.05,\"query-feedback-limit\":1024,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"cross-join\":true,\"run-auto-analyze\":true,\"agg-push-down-join\":false,\"committer-concurrency\":16,\"max-txn-ttl\":600000},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"commit-timeout\":\"41s\",\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"120s\",\"copr-cache\":{\"enable\":false,\"capacity-mb\":1000,\"admission-max-result-mb\":10,\"admission-min-process-ms\":5}},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"enable\":true,\"max-retry-count\":256},\"check-mb4-value-in-utf8\":true,\"max-index-length\":12288,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":200,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":0,\"new_collations_enabled_on_first_bootstrap\":false,\"experimental\":{\"allow-auto-random\":false,\"allow-expression-index\":false}}"]
[2020/09/23 10:01:17.636 +08:00] [INFO] [main.go:341] ["disable Prometheus push client"]
[2020/09/23 10:01:17.636 +08:00] [INFO] [store.go:68] ["new store"] [path=tikv://172.xx.xx.84:2379,172.xx.xx.86:2379,172.xx.xx.85:2379]
[2020/09/23 10:01:17.636 +08:00] [INFO] [client.go:149] ["[pd] create pd client with endpoints"] [pd-address="[172.xx.xx.84:2379,172.xx.xx.86:2379,172.xx.xx.85:2379]"]
[2020/09/23 10:01:17.636 +08:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2020/09/23 10:01:17.638 +08:00] [INFO] [base_client.go:226] ["[pd] update member urls"] [old-urls="[http://172.xx.xx.84:2379,http://172.xx.xx.86:2379,http://172.xx.xx.85:2379]"] [new-urls="[http://172.xx.xx.84:2379,http://172.xx.xx.85:2379,http://172.xx.xx.86:2379]"]
[2020/09/23 10:01:17.638 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://172.18.240.86:2379] [old-leader=]
[2020/09/23 10:01:17.638 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=6839334709286030178]
[2020/09/23 10:01:17.641 +08:00] [INFO] [store.go:74] ["new store with retry success"]
[2020/09/23 10:01:17.661 +08:00] [INFO] [tidb.go:71] ["new domain"] [store=tikv-6839334709286030178] ["ddl lease"=45s] ["stats lease"=3s]
[2020/09/23 10:01:17.664 +08:00] [INFO] [ddl.go:321] ["[ddl] start DDL"] [ID=a636a635-3562-46ae-9565-41f7725586b2] [runWorker=true]
[2020/09/23 10:01:17.666 +08:00] [INFO] [ddl.go:309] ["[ddl] start delRangeManager OK"] ["is a emulator"=false]
[2020/09/23 10:01:17.666 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 1, tp general"]
[2020/09/23 10:01:17.667 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 2, tp add index"]
[2020/09/23 10:01:17.835 +08:00] [INFO] [domain.go:144] ["full load InfoSchema success"] [usedSchemaVersion=0] [neededSchemaVersion=5897] ["start time"=154.344752ms]
[2020/09/23 10:01:17.837 +08:00] [INFO] [domain.go:368] ["full load and reset schema validator"]
[2020/09/23 10:01:19.103 +08:00] [INFO] [coprocessor.go:871] ["[TIME_COP_WAIT] resp_time:711.800695ms txnStartTS:419647056338747403 region_id:85 store_addr:172.xx.xx.87:20160 kv_wait_ms:707"]

附上当时的tidb server节点的内存使用:

  1. tidb-server 的内存是多大? 从你反馈的监控看内存占用好像不是很多。
  2. cpu 和 IO 是否在这个时间段有异常?
  3. 从 2.0 如何升级到 4.0 的? 有修改过哪些参数吗? 总共有几个tidb-server,都会有这个问题吗?
  4. 请查看 tidb 目录下,其他日志,比如 err 日志是否有记录信息?

经过这边的排查,可以确定是某个账户的SQL导致的。

发现,这个账户 会发一个批量的SQL到tidb4.0 集群上,导致的tidb server节点重启。

单个SQL 执行时间超过60s,算子都能命中索引,跑在tikv节点上,join是在tidb server做的,占用78 mb的内存。

怀疑 SQL 太多,触发了关于内存的参数。 我们单机是 32GB ,设置的单条SQL上限 是25GB

具体问题还在排查中。

好的,有进展或者需要协助排查的方向再反馈,多谢。

在吗,我3.0.12版本也出现了你这种问题,也是某几条sql引起的,但不知道具体原因,你排查到了吗?