Panic: db connect err: dial tcp 10.0.0.5:4000: connect: connection timed out

1、10.0.0.5,是我们的负载均衡 IP,在负载均衡层面,我们已经加长了健康检测时间,以及会话保持。 2、在应用程序层面,我们加一个心跳去连接 10.0.0.5:4000 3、tidb 的部署方面是 ansible 部署的,没有修改配置。

然而,程序的日志经常这种错误。

给个思路,tidb 需要做哪些调整么?

  • 查看下当前的连接数,如果每个 tidb-server 的连接数超过了 1000,可以试试调整 token-limit 参数。
  • 查看下当前 TIDB 监控中 query summary 中 Failed Query OPM 监控中是否有什么异常。
  • 检查通讯网络
  • 调大超时时间

现在能复盘的信息,就只有 query summary

上面这个截图,显示的,是否我们 tikv 出了问题?需要怎么调整?

show full processlist; 查看的是当前连接的 tidb 的进程数么?还是全部 tidb 入口的进程数?

这些监控指标,有通俗一点说明么,作用是什么,出现了这些指标的数据,意味着什么呢?

  • kv 1062 代表有冲突,在 tidb.log 中搜关键字 conflict 和 retry,查找相关 SQL
  • 连接数可以看 TiDB 监控中 server 面板下有 connection count。
  • 这些监控在官方文档中有说明。可以搜索关键字: 监控
[2019/08/17 17:46:05.256 +08:00] [WARN] [session.go:353] [sql] [conn=2481244] [label=general] [error="[try again later]: WriteConflict: txnStartTS=410526722306867203, conflictTS=410526722333081601, key={tableID=1221, handle=7} primary={tableID=1207, handle=17107}"] [errorVerbose="WriteConflict: txnStartTS=410526722306867203, conflictTS=410526722333081601, key={tableID=1221, handle=7} primary={tableID=1207, handle=17107}
github.com/pingcap/tidb/store/tikv.extractLockFromKeyErr
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:302
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).prewriteSingleBatch
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:415
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches.func1
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:329
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337
[try again later]"] [txn="Txn{state=invalid}"]
[2019/08/17 16:08:04.772 +08:00] [WARN] [session.go:382] ["commit failed"] [conn=2449040] ["finished txn"="Txn{state=invalid}"] [error="[kv:1062]Duplicate entry 'dih5puew75-2-474c797713f37901928aafb6adbae0241d1750bd' for key 'idx_preid_type_name_hash'"] [errorVerbose="[kv:1062]Duplicate entry 'dih5puew75-2-474c797713f37901928aafb6adbae0241d1750bd' for key 'idx_preid_type_name_hash'
github.com/pingcap/errors.AddStack
	/home/jenkins/workspace/release_tidb_2.1-ga/go/pkg/mod/github.com/pingcap/errors@v0.11.1/errors.go:174
github.com/pingcap/errors.Trace
	/home/jenkins/workspace/release_tidb_2.1-ga/go/pkg/mod/github.com/pingcap/errors@v0.11.1/juju_adaptor.go:15
github.com/pingcap/tidb/kv.(*unionStore).CheckLazyConditionPairs
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/kv/union_store.go:221
github.com/pingcap/tidb/store/tikv.(*tikvTxn).Commit
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/txn.go:210
github.com/pingcap/tidb/session.(*TxnState).Commit
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/txn.go:194
github.com/pingcap/tidb/session.(*session).doCommit
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:324
github.com/pingcap/tidb/session.(*session).retry
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:561
github.com/pingcap/tidb/session.(*session).doCommitWithRetry
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:361
github.com/pingcap/tidb/session.(*session).CommitTxn
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:399
github.com/pingcap/tidb/session.finishStmt
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/tidb.go:163
github.com/pingcap/tidb/session.runStmt
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/tidb.go:219
github.com/pingcap/tidb/session.(*session).executeStatement
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:831
github.com/pingcap/tidb/session.(*session).execute
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:901
github.com/pingcap/tidb/session.(*session).Execute
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/session/session.go:850
github.com/pingcap/tidb/server.(*TiDBContext).Execute
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/server/driver_tidb.go:242
github.com/pingcap/tidb/server.(*clientConn).handleQuery
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/server/conn.go:933
github.com/pingcap/tidb/server.(*clientConn).dispatch
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/server/conn.go:667
github.com/pingcap/tidb/server.(*clientConn).Run
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/server/conn.go:504
github.com/pingcap/tidb/server.(*Server).onConn
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/server/server.go:383
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337"]

日志已贴,这种错误,在我们业务中,应用输出的错误日志

2019-08-19T07:03:07.03935469Z e[33m[2019-08-19 07:03:07]e[0m e[31;1m Error 1062: Duplicate entry 'e2kdojnozw-2-f3798f81c7b6fecad2cbfec741314f8a66c0eca3' for key 'idx_preid_type_name_hash' e[0m

这种错误导致的整个集群出现一些 timeout 么?

  • 根据 TSO 号,来确认下冲突 SQL 是什么,对此可以优化下业务逻辑。比如是否有同时修改同一行的场景
  • 冲突场景一般不会导致连接超时,但是需要处理。
  • TIDB 中没有连接限制,不过 token-limit:1000 代表一个 tidb-server 只能同时处理 1000 个连接,如果远远大于这个值,就需要调整这个参数了。

TSO 号 具体怎么看?

[2019/08/17 16:19:05.840 +08:00] [WARN] [session.go:353] [sql] [conn=2456551] [label=general] [error="[try again later]: WriteConflict: txnStartTS=410525354111795242, conflictTS=410525354085580808, key={tableID=1221, handle=1042} primary={tableID=1207, handle=3434}"] [errorVerbose="WriteConflict: txnStartTS=410525354111795242, conflictTS=410525354085580808, key={tableID=1221, handle=1042} primary={tableID=1207, handle=3434}
github.com/pingcap/tidb/store/tikv.extractLockFromKeyErr
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:302
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).prewriteSingleBatch
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:415
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches.func1
	/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:329
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337
[try again later]"] [txn="Txn{state=invalid}"]

上面这个日志,哪一个字段是 TSO 呢?

还有就是连接数的监控有了么?

连接数很少。

OK,那查查冲突语句吧。

开启一个事务,A表插入一条数据,B表更新某一条数据,然后提交事务,这种高并发批量执行很多这种事务的情况下,会不会造成事务retry的情况?

您好,当有大量事务冲突并且retry的话,会不会造成timeout的情况?

  • 如果有修改同一行的情况,会有 retry 的情况。
  • 不会

这种大量 retrying 的 WARN 日志,对整体性能有影响么?

  • 会的,需要优化掉