【 TiDB 使用环境】生产环境
【 TiDB 版本】6.5.0
【复现路径】做过哪些操作出现的问题
3db,3pd,6kv,3flash
集群配置
server_configs:
tidb:
graceful-wait-before-shutdown: 120
log.level: warn
performance.txn-total-size-limit: 1099511627776
pessimistic-txn.pessimistic-auto-commit: true
tikv:
log-backup.enable: false
log.level: warn
raft-engine.enable: true
raftstore.apply-max-batch-size: 2048
raftstore.apply-pool-size: 3
raftstore.store-io-pool-size: 3
raftstore.store-max-batch-size: 2048
raftstore.store-pool-size: 3
readpool.storage.use-unified-pool: true
readpool.unified.auto-adjust-pool-size: true
resolved-ts.enable: false
storage.enable-async-apply-prewrite: true
pd:
dashboard.enable-telemetry: false
schedule.low-space-ratio: 0.9
tidb_dashboard: {}
tiflash:
log.level: warn
profiles.default.max_memory_usage: 90194313216
profiles.default.max_memory_usage_for_all_queries: 90194313216
jdbc批量插入,偶发事务提交失败,出现时 tidb 无异常,tikv 单一节点出现内存使用增长,被 oom killer 杀掉
jdbc client 异常:
java.sql.BatchUpdateException: Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916744257558), commit_ts: TimeStamp(451852918133358611), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 161, 247, 28] }))))) {tableID=319, handle=14472574748} [try again later]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.util.Util.handleNewInstance(Util.java:191)
at com.mysql.cj.util.Util.getInstance(Util.java:166)
at com.mysql.cj.util.Util.getInstance(Util.java:173)
at com.mysql.cj.jdbc.exceptions.SQLError.createBatchUpdateException(SQLError.java:224)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchedInserts(ClientPreparedStatement.java:772)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:443)
at com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:814)
at com.pyamc.spark.jdbc.MysqlWriteUtil$1.call(MysqlWriteUtil.java:99)
at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2(Dataset.scala:3072)
at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2$adapted(Dataset.scala:3072)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1011)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1011)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916744257558), commit_ts: TimeStamp(451852918133358611), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 161, 247, 28] }))))) {tableID=319, handle=14472574748} [try again later]
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.mysql.cj.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:634)
at com.mysql.cj.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:414)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1109)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1057)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1377)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchedInserts(ClientPreparedStatement.java:733)
... 16 more
tidb异常:
[2024/08/15 08:35:12.035 +08:00] [WARN] [session.go:998] ["can not retry txn"] [conn=7886513232037681181] [label=general] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916822376456), commit_ts: TimeStamp(451852918133358604), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 22, 92] }))))) {tableID=319, handle=14472582748} [try again later]"] [IsBatchInsert=false] [IsPessimistic=true] [InRestrictedSQL=false] [tidb_retry_limit=10] [tidb_disable_txn_auto_retry=true]
[2024/08/15 08:35:12.035 +08:00] [WARN] [session.go:1014] ["commit failed"] [conn=7886513232037681181] ["finished txn"="Txn{state=invalid}"] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916822376456), commit_ts: TimeStamp(451852918133358604), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 22, 92] }))))) {tableID=319, handle=14472582748} [try again later]"]
[2024/08/15 08:35:12.038 +08:00] [WARN] [2pc.go:1173] ["stop TxnHeartBeat"] [error="txn 451852916744257558 heartbeat fail, primary key = 74800000000000013f5f72800000035ea1f71c, err = txn 451852916744257558 not found"] [consecutiveFailure=1] [txnStartTS=451852916744257558]
[2024/08/15 08:35:12.069 +08:00] [WARN] [session.go:2218] ["run statement failed"] [conn=7886513232037681181] [schemaVersion=624787] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916822376456), commit_ts: TimeStamp(451852918133358604), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 22, 92] }))))) {tableID=319, handle=14472582748} [try again later]"] [session="{\n \"currDBName\": \"dcdb\",\n \"id\": 7886513232037681181,\n \"lastInsertID\": 14472582748,\n \"preparedStmtCount\": 2,\n \"status\": 2,\n \"strictMode\": false,\n \"user\": {\n \"Username\": \"dbc\",\n \"Hostname\": \"10.120.33.79\",\n \"CurrentUser\": false,\n \"AuthUsername\": \"dbc\",\n \"AuthHostname\": \"%\"\n }\n}"]
[2024/08/15 08:35:12.232 +08:00] [WARN] [session.go:998] ["can not retry txn"] [conn=7886513232037681127] [label=general] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916783054864), commit_ts: TimeStamp(451852918133358609), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 2, 212] }))))) {tableID=319, handle=14472577748} [try again later]"] [IsBatchInsert=false] [IsPessimistic=true] [InRestrictedSQL=false] [tidb_retry_limit=10] [tidb_disable_txn_auto_retry=true]
[2024/08/15 08:35:12.232 +08:00] [WARN] [session.go:1014] ["commit failed"] [conn=7886513232037681127] ["finished txn"="Txn{state=invalid}"] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916783054864), commit_ts: TimeStamp(451852918133358609), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 2, 212] }))))) {tableID=319, handle=14472577748} [try again later]"]
[2024/08/15 08:35:12.259 +08:00] [WARN] [session.go:2218] ["run statement failed"] [conn=7886513232037681127] [schemaVersion=624787] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916783054864), commit_ts: TimeStamp(451852918133358609), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 162, 2, 212] }))))) {tableID=319, handle=14472577748} [try again later]"] [session="{\n \"currDBName\": \"dcdb\",\n \"id\": 7886513232037681127,\n \"lastInsertID\": 14472577748,\n \"preparedStmtCount\": 2,\n \"status\": 2,\n \"strictMode\": false,\n \"user\": {\n \"Username\": \"dbc\",\n \"Hostname\": \"10.120.33.79\",\n \"CurrentUser\": false,\n \"AuthUsername\": \"dbc\",\n \"AuthHostname\": \"%\"\n }\n}"]
[2024/08/15 08:35:12.268 +08:00] [WARN] [session.go:998] ["can not retry txn"] [conn=7886513232037681165] [label=general] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916744257558), commit_ts: TimeStamp(451852918133358611), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 161, 247, 28] }))))) {tableID=319, handle=14472574748} [try again later]"] [IsBatchInsert=false] [IsPessimistic=true] [InRestrictedSQL=false] [tidb_retry_limit=10] [tidb_disable_txn_auto_retry=true]
[2024/08/15 08:35:12.268 +08:00] [WARN] [session.go:1014] ["commit failed"] [conn=7886513232037681165] ["finished txn"="Txn{state=invalid}"] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916744257558), commit_ts: TimeStamp(451852918133358611), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 161, 247, 28] }))))) {tableID=319, handle=14472574748} [try again later]"]
[2024/08/15 08:35:12.296 +08:00] [WARN] [2pc.go:1173] ["stop TxnHeartBeat"] [error="txn 451852917150318593 heartbeat fail, primary key = 74800000000000013f5f72800000035ea231b4, err = txn 451852917150318593 not found"] [consecutiveFailure=1] [txnStartTS=451852917150318593]
[2024/08/15 08:35:12.299 +08:00] [WARN] [session.go:2218] ["run statement failed"] [conn=7886513232037681165] [schemaVersion=624787] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(451852916744257558), commit_ts: TimeStamp(451852918133358611), key: [116, 128, 0, 0, 0, 0, 0, 1, 63, 95, 114, 128, 0, 0, 3, 94, 161, 247, 28] }))))) {tableID=319, handle=14472574748} [try again later]"] [session="{\n \"currDBName\": \"dcdb\",\n \"id\": 7886513232037681165,\n \"lastInsertID\": 14472574748,\n \"preparedStmtCount\": 2,\n \"status\": 2,\n \"strictMode\": false,\n \"user\": {\n \"Username\": \"dbc\",\n \"Hostname\": \"10.120.33.79\",\n \"CurrentUser\": false,\n \"AuthUsername\": \"dbc\",\n \"AuthHostname\": \"%\"\n }\n}"]
[2024/08/15 08:35:12.393 +08:00] [WARN] [2pc.go:1173] ["stop TxnHeartBeat"] [error="txn 451852917281390593 heartbeat fail, primary key = 74800000000000013f5f72800000035ea23984, err = txn 451852917281390593 not found"] [consecutiveFailure=1] [txnStartTS=451852917281390593]
[2024/08/15 08:35:12.440 +08:00] [WARN] [2pc.go:1173] ["stop TxnHeartBeat"] [error="txn 451852917294497794 heartbeat fail, primary key = 74800000000000013f5f72800000035ea24154, err = txn 451852917294497794 not found"] [consecutiveFailure=1] [txnStartTS=451852917294497794]
[2024/08/15 08:35:12.441 +08:00] [WARN] [2pc.go:1173] ["stop TxnHeartBeat"] [error="txn 451852917307342849 heartbeat fail, primary key = 74800000000000013f5f72800000035ea2453c, err = txn 451852917307342849 not found"] [consecutiveFailure=1] [txnStartTS=451852917307342849]
tikv 异常:
[2024/08/15 08:36:51.295 +08:00] [WARN] [scheduler.rs:1088] ["[region 1904862] scheduler handle command: prewrite, ts: 451852949277900802"] [takes=1171]
[2024/08/15 08:36:51.302 +08:00] [WARN] [scheduler.rs:1088] ["[region 1904862] scheduler handle command: prewrite, ts: 451852949288910854"] [takes=1118]
[2024/08/15 08:36:51.312 +08:00] [WARN] [scheduler.rs:1088] ["[region 1904862] scheduler handle command: prewrite, ts: 451852949157838858"] [takes=1035]
[2024/08/15 08:37:41.034 +08:00] [WARN] [advance.rs:349] ["check leader rpc costs too long, to_store: 7"] [takes=4557]
[2024/08/15 08:37:41.036 +08:00] [ERROR] [raft_client.rs:562] ["connection aborted"] [addr=10.120.33.66:20160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"Connection reset by peer\", details: [] }))"] [sink_error="Some(RpcFinished(Some(
RpcStatus { code: 14-UNAVAILABLE, message: \"Connection reset by peer\", details: [] })))"] [store_id=7]
[2024/08/15 08:37:41.037 +08:00] [ERROR] [raft_client.rs:858] ["connection abort"] [addr=10.120.33.66:20160] [store_id=7]
[2024/08/15 08:37:41.043 +08:00] [WARN] [endpoint.rs:1053] ["cdc resolve region leadership"] [takes=4613]
[2024/08/15 08:37:41.098 +08:00] [ERROR] [kv.rs:671] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2024/08/15 08:37:45.652 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:46.083 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:46.089 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 308468477 store_id: 7"] [peer_id=168537260] [region_id=168537258] [type=MsgRegionWakeUp]
[2024/08/15 08:37:46.089 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 168537258, leader may None\" not_leader { region_id: 168537258 }"]
[2024/08/15 08:37:46.091 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 308468477 store_id: 7"] [peer_id=168537260] [region_id=168537258] [type=MsgRegionWakeUp]
[2024/08/15 08:37:46.107 +08:00] [ERROR] [raft_client.rs:821] ["wait connect timeout"] [addr=10.120.33.66:20160] [store_id=7]
[2024/08/15 08:37:46.238 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:46.239 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:46.536 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:47.222 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 174151683, leader may None\" not_leader { region_id: 174151683 }"]
[2024/08/15 08:37:47.229 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:47.753 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:47.869 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:47.985 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311508881 store_id: 7"] [peer_id=311459882] [region_id=311459879] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.033 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311508645 store_id: 7"] [peer_id=311460055] [region_id=311460052] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.039 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311578565 store_id: 7"] [peer_id=311577794] [region_id=311569938] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.043 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311570487 store_id: 7"] [peer_id=311525941] [region_id=311525938] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.258 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311571179 store_id: 7"] [peer_id=311571178] [region_id=311571176] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.267 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311569993 store_id: 7"] [peer_id=311569994] [region_id=311569991] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.303 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311577819 store_id: 7"] [peer_id=311577820] [region_id=311577818] [type=MsgHibernateRequest]
[2024/08/15 08:37:48.572 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.347 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.377 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.377 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.377 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.384 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.453 +08:00] [WARN] [endpoint.rs:780] [error-response] [err="Region error (will back off and retry) message: \"stale command\""]
[2024/08/15 08:37:49.573 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311517636 store_id: 7"] [peer_id=311517635] [region_id=311517633] [type=MsgHibernateRequest]
[2024/08/15 08:37:49.699 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311510694 store_id: 7"] [peer_id=311459755] [region_id=311459752] [type=MsgHibernateRequest]
[2024/08/15 08:37:49.796 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311508625 store_id: 7"] [peer_id=311459763] [region_id=311459760] [type=MsgHibernateRequest]
[2024/08/15 08:37:49.808 +08:00] [ERROR] [peer.rs:5238] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 311508672 store_id: 7"] [peer_id=311459944] [region_id=311459941] [type=MsgHibernateRequest]
oom killer
[Thu Aug 15 08:37:35 2024] grpc-server-2 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Thu Aug 15 08:37:35 2024] grpc-server-2 cpuset=/ mems_allowed=0
tikv 基础信息
tikv grpc
raft entry
目前在排查:
grpc-memory-pool-quota
pessimistic-txn.pipelined
pessimistic-txn.in-memory