tidb 节点报错

业务侧反馈在业务跑批时频繁收到这样的报错:

[INFO] 2025-04-15 09:47:11.528  - [taskAppId=TASK-96058-34550909-56355485]:[181] -  -> time="2025-04-15 09:47:11" level=info msg="[MONITOR] queue size: 29800, count: 1944900, read rate/s: 64830.00"
[INFO] 2025-04-15 09:47:19.456  - [taskAppId=TASK-96058-34550909-56355485]:[181] -  -> [mysql] 2025/04/15 09:47:19 packets.go:37: read tcp xxxx:51930->xxxx:4000: i/o timeout
[INFO] 2025-04-15 09:47:20.457  - [taskAppId=TASK-96058-34550909-56355485]:[181] -  -> time="2025-04-15 09:47:19" level=warning msg="sql exec fail, err: invalid connection"

在跑批时 TiDB 也有如下的报错:

2025-04-15 09:40:05 (UTC+08:00)TiDB xxx:4000[session.go:3899] ["CRUCIAL OPERATION"] [conn=1523893894] [schemaVersion=166448] [cur_db=dj_report] [sql="ALTER TABLE xxx TRUNCATE PARTITION p20250414"] [user=xxx@xxx]
2025-04-15 09:40:06 (UTC+08:00)TiDB xxx:4000[ddl_worker.go:1023] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=158756] [conn=1523893894] [category=ddl] [job="ID:158756, Type:truncate partition, State:queueing, SchemaState:public, SchemaID:69, TableID:15286, RowCount:0, ArgLen:0, start time: 2025-04-15 09:40:05.978 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
2025-04-15 09:40:06 (UTC+08:00)TiDB xxx:4000[ddl_worker.go:610] ["finish DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=158756] [conn=1523893894] [job="ID:158756, Type:truncate partition, State:synced, SchemaState:none, SchemaID:69, TableID:15286, RowCount:0, ArgLen:1, start time: 2025-04-15 09:40:05.978 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
2025-04-15 09:41:45 (UTC+08:00)TiDB xxx:4000[region_request.go:1754] ["throwing pseudo region error due to no replica available"] [conn=1523893894] [session_alias=] [req-ts=457357735933247576] [req-type=Prewrite] [region="{ region id: 29430011, ver: 281784, confVer: 75759 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:<nil>,replicaSelector: replicaSelector{selectorStateStr: tryFollower, cacheRegionIsValid: false, replicaStatus: [peer: 29430012, store: 11, isEpochStale: false, attempts: 1, attempts_time: 676.5µs, replica-epoch: 2, store-epoch: 2, store-state: resolved, store-liveness-state: reachable peer: 29430013, store: 6, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 3, store-epoch: 3, store-state: resolved, store-liveness-state: reachable peer: 29430014, store: 5169885, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 3, store-epoch: 3, store-state: resolved, store-liveness-state: reachable]}}"] [total-round-stats="{total-backoff: 380ms, total-backoff-times: 13}"] [current-round-stats="{time: 128.9ms, backoff: 128ms, timeout: 30s, req-max-exec-timeout: 20s, retry-times: 1}"]
2025-04-15 09:41:45 (UTC+08:00)TiDB xxx:4000[region_request.go:1754] ["throwing pseudo region error due to no replica available"] [conn=1523893894] [session_alias=] [req-ts=457357735933247576] [req-type=Prewrite] [region="{ region id: 29430011, ver: 281784, confVer: 75759 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:<nil>,replicaSelector: replicaSelector{selectorStateStr: tryFollower, cacheRegionIsValid: false, replicaStatus: [peer: 29430012, store: 11, isEpochStale: false, attempts: 1, attempts_time: 663.4µs, replica-epoch: 2, store-epoch: 2, store-state: resolved, store-liveness-state: reachable peer: 29430013, store: 6, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 3, store-epoch: 3, store-state: resolved, store-liveness-state: reachable peer: 29430014, store: 5169885, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 3, store-epoch: 3, store-state: resolved, store-liveness-state: reachable]}}"] [total-round-stats="{total-backoff: 764ms, total-backoff-times: 15}"] [current-round-stats="{time: 257.4ms, backoff: 256ms, timeout: 30s, req-max-exec-timeout: 20s, retry-times: 1}"]

反复报 throwing pseudo region error due to no replica available

region 的状态也是正常的:

参考过这个帖子,业务侧也把客户端升级到了 >= 1.23.2 版本

想请教下为什么报错?有什么解决办法,谢谢

看下集群状态,是否有节点状态有问题。另外,从报错信息看,可能是数据某个 Region 的所有副本都丢失(例如磁盘损坏或数据被意外删除)

先看下网络有没有丢包吧

看看监控,当时 tikv 的压力大不大或者有没有热点。

感觉就是单纯压力大 导致 region 在分裂等,忙不过来。

当时存在热点写情况

问题时间段的网路情况呢?需要重点看一下有无丢包和延迟情况

热点写的话,看下盘的压力,CPU 等资源是不是很高。

可以考虑加个 shard 来打散热点。看看效果。

请教下 lower_value 和 upper_value 是指的什么,是指的表主键 id 吗?

不加 index name 就是 table 的 handle id。要么是主键要么是 _tidb_rowid;
加了 index 就是打散索引,就是索引最大最小值。

我建了个空表,然后预打散,发现报错
mysql> SPLIT TABLE t1 BETWEEN (0) AND (18446744073709551615) REGIONS 128;
ERROR 1105 (HY000): Split table region lower value count should be 2

避免由 _tidb_rowid 带来的写入热点问题,可以在建表时,使用

SHARD_ROW_ID_BITS 和 PRE_SPLIT_REGIONS 这两个建表选项

SHARD_ROW_ID_BITS 用于将 _tidb_rowid 列生成的行 ID 随机打散。

PRE_SPLIT_REGIONS 用于在建完表后预先进行 Split region

例如:

create table t (a int, b int) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;

我之前尝试过改成 2,还是报相同的错,还有请问如果是有自增主键的聚簇表,想打散如何操作

热点造成写的压力比较大啊,通过shard打散试下呢。

(root@127.0.0.1) [test]>CREATE TABLE t1 (
    ->   id BIGINT PRIMARY KEY AUTO_INCREMENT,
    ->   name VARCHAR(32)
    -> );
Query OK, 0 rows affected (0.07 sec)

(root@127.0.0.1) [test]>SPLIT TABLE t1 BETWEEN (0) AND (18446744073709551615) REGIONS 12;
ERROR 1690 (22003): constant 18446744073709551615 overflows bigint
(root@127.0.0.1) [test]>SPLIT TABLE t1 BETWEEN (0) AND (1844674) REGIONS 12;
+--------------------+----------------------+
| TOTAL_SPLIT_REGION | SCATTER_FINISH_RATIO |
+--------------------+----------------------+
|                 11 |                    1 |
+--------------------+----------------------+
1 row in set (1.14 sec)
mysql> CREATE TABLE `t1` (
    ->   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '',
    ->   `u` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
    ->   `p` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
    ->   `k` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
    ->   `c` date NOT NULL DEFAULT '0001-01-01' COMMENT '',
    ->   PRIMARY KEY (`id`,`c`) /*T![clustered_index] CLUSTERED */,
    ->   KEY `idx_userid` (`u`,`c`),
    ->   KEY `idx_user_keyword` (`c`,`u`,`k`),
    ->   KEY `idx_user_plan` (`c`,`u`,`p`)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=113838548138 COMMENT=''
    -> PARTITION BY RANGE COLUMNS(`c`)
    -> (PARTITION `p20250409` VALUES LESS THAN ('20250410'),
    ->  PARTITION `p20250410` VALUES LESS THAN ('20250411'),
    ->  PARTITION `p20250411` VALUES LESS THAN ('20250412'),
    ->  PARTITION `p20250412` VALUES LESS THAN ('20250413'),
    ->  PARTITION `p20250413` VALUES LESS THAN ('20250414'),
    ->  PARTITION `p20250414` VALUES LESS THAN ('20250415'),
    ->  PARTITION `p20250415` VALUES LESS THAN ('20250416'),
    ->  PARTITION `p20250416` VALUES LESS THAN ('20250417'),
    ->  PARTITION `p20250417` VALUES LESS THAN ('20250418'),
    ->  PARTITION `p20250418` VALUES LESS THAN ('20250419'),
    ->  PARTITION `p20250419` VALUES LESS THAN ('20250420'),
    ->  PARTITION `p20250420` VALUES LESS THAN ('20250421'),
    ->  PARTITION `p20250421` VALUES LESS THAN ('20250422'),
    ->  PARTITION `p20250422` VALUES LESS THAN ('20250423'),
    ->  PARTITION `p20250423` VALUES LESS THAN ('20250424'),
    ->  PARTITION `p20250424` VALUES LESS THAN ('20250425'),
    ->  PARTITION `p20250425` VALUES LESS THAN ('20250426'),
    ->  PARTITION `p20250426` VALUES LESS THAN ('20250427'),
    ->  PARTITION `p20250427` VALUES LESS THAN ('20250428'),
    ->  PARTITION `p20250428` VALUES LESS THAN ('20250429'),
    ->  PARTITION `p20250429` VALUES LESS THAN ('20250430'));
Query OK, 0 rows affected (0.16 sec)

mysql> SPLIT TABLE t1 BETWEEN (2) AND (9223372036854775807) REGIONS 64;
ERROR 1105 (HY000): Split table region lower value count should be 2

CREATE TABLE `t1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '',
  `u` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
  `p` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
  `k` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT '',
  `c` date NOT NULL DEFAULT '0001-01-01' COMMENT '',
  PRIMARY KEY (`id`,`c`) NONCLUSTERED,
  KEY `idx_userid` (`u`,`c`),
  KEY `idx_user_keyword` (`c`,`u`,`k`),
  KEY `idx_user_plan` (`c`,`u`,`p`)
) SHARD_ROW_ID_BITS = 2 PRE_SPLIT_REGIONS=2 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=113838548138 COMMENT='' 
PARTITION BY RANGE COLUMNS(`c`)
(PARTITION `p20250409` VALUES LESS THAN ('20250410'),
 PARTITION `p20250410` VALUES LESS THAN ('20250411'),
 PARTITION `p20250411` VALUES LESS THAN ('20250412'),
 PARTITION `p20250412` VALUES LESS THAN ('20250413'),
 PARTITION `p20250413` VALUES LESS THAN ('20250414'),
 PARTITION `p20250414` VALUES LESS THAN ('20250415'),
 PARTITION `p20250415` VALUES LESS THAN ('20250416'),
 PARTITION `p20250416` VALUES LESS THAN ('20250417'),
 PARTITION `p20250417` VALUES LESS THAN ('20250418'),
 PARTITION `p20250418` VALUES LESS THAN ('20250419'),
 PARTITION `p20250419` VALUES LESS THAN ('20250420'),
 PARTITION `p20250420` VALUES LESS THAN ('20250421'),
 PARTITION `p20250421` VALUES LESS THAN ('20250422'),
 PARTITION `p20250422` VALUES LESS THAN ('20250423'),
 PARTITION `p20250423` VALUES LESS THAN ('20250424'),
 PARTITION `p20250424` VALUES LESS THAN ('20250425'),
 PARTITION `p20250425` VALUES LESS THAN ('20250426'),
 PARTITION `p20250426` VALUES LESS THAN ('20250427'),
 PARTITION `p20250427` VALUES LESS THAN ('20250428'),
 PARTITION `p20250428` VALUES LESS THAN ('20250429'));

预先打散了,不过还是有热点写的情况,这种情况是不是得需要改主键了?

看下分区的 region 分布。如果集中 1 个节点就试试 scatter。https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md