双云灾备建设,使用sync-diff-inspector 工具校验后,replace补数据执行失败

版本:5.0.4
架构:双云灾备


问题:使用 sync-diff-inspector 工具校验数据后,执行replace语句失败:
【概述】 使用 sync-diff-inspector 校验两个tidb集群数据是否一致
结果数据不一致,执行replace语句后写入失败。

唯一键:

  PRIMARY KEY (`I_ID`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `FSI` (`I_FILE_ID`),
ERROR 1105 (HY000) at line 40950: can not be duplicated row, due to old row not found. handle 2396698 not found
ERROR 1105 (HY000) at line 40951: can not be duplicated row, due to old row not found. handle 2396699 not found
ERROR 1105 (HY000) at line 40952: can not be duplicated row, due to old row not found. handle 2396700 not found
ERROR 1105 (HY000) at line 40953: can not be duplicated row, due to old row not found. handle 2396701 not found
ERROR 1105 (HY000) at line 40956: can not be duplicated row, due to old row not found. handle 2396704 not found
ERROR 1105 (HY000) at line 40957: can not be duplicated row, due to old row not found. handle 2396705 not found
ERROR 1105 (HY000) at line 40964: can not be duplicated row, due to old row not found. handle 2396712 not found

执行admin check table失败

mysql> admin check table f_file_info;
ERROR 8003 (HY000): f_file_info err:[admin:8223]index:<nil> != record:&admin.RecordData{Handle:54323, Values:[]types.Datum{types.Datum{k:0x5, decimal:0x0, length:0x0, i:0, collation:"utf8mb4_bin", b:[]uint8{0x65, 0x37, 0x34, 0x39, 0x35, 0x62, 0x33, 0x30, 0x34, 0x32, 0x66, 0x64, 0x39, 0x65, 0x38, 0x35, 0x65, 0x32, 0x61, 0x65, 0x37, 0x63, 0x36, 0x66, 0x66, 0x65, 0x31, 0x63, 0x31, 0x32, 0x34, 0x62, 0x66, 0x34, 0x65, 0x61, 0x36, 0x34, 0x35, 0x39}, x:interface {}(nil)}}}

详细步骤如下:

  1. 检查表region在哪个TiKV节点
mysql> SPLIT TABLE sbtest1 BETWEEN (0) AND (100000) REGIONS 8;
+--------------------+----------------------+
| TOTAL_SPLIT_REGION | SCATTER_FINISH_RATIO |
+--------------------+----------------------+
|                  8 |                    1 |
+--------------------+----------------------+
1 row in set (3.54 sec)
mysql> show table sbtest1 regions;
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
| REGION_ID | START_KEY    | END_KEY      | LEADER_ID | LEADER_STORE_ID | PEERS            | SCATTERING | WRITTEN_BYTES | READ_BYTES | APPROXIMATE_SIZE(MB) | APPROXIMATE_KEYS |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
|      1596 | t_85_r       | t_85_r_12500 |      1624 |               3 | 1626, 1624, 1625 |          0 |          1740 |          0 |                    1 |                0 |
|      1600 | t_85_r_12500 | t_85_r_25000 |      1602 |              14 | 1601, 1602, 1603 |          0 |          1503 |          0 |                    1 |                0 |
|      1604 | t_85_r_25000 | t_85_r_37500 |      1627 |              13 | 1628, 1629, 1627 |          0 |          4178 |          0 |                    1 |                0 |
|      1608 | t_85_r_37500 | t_85_r_50000 |      1611 |               2 | 1609, 1610, 1611 |          0 |          1659 |          0 |                    1 |                0 |
|      1612 | t_85_r_50000 | t_85_r_62500 |      1632 |               1 | 1632, 1630, 1631 |          0 |          1725 |          0 |                    1 |                0 |
|      1616 | t_85_r_62500 | t_85_r_75000 |      1618 |              14 | 1617, 1618, 1619 |          0 |            27 |          0 |                    1 |                0 |
|      1620 | t_85_r_75000 | t_85_r_87500 |      1635 |               1 | 1635, 1633, 1634 |          0 |          1725 |          0 |                    1 |                0 |
|         5 | t_85_r_87500 |              |       792 |               4 | 792, 1296, 1326  |          0 |          4258 |    2470034 |                    1 |                0 |
|      1592 | t_85_        | t_85_r       |      1593 |               4 | 1593, 1594, 1595 |          0 |          1491 |          0 |                    3 |            11363 |
+-----------+--------------+--------------+-----------+-----------------+------------------+------------+---------------+------------+----------------------+------------------+
9 rows in set (0.01 sec)

- pd节点执行
tiup ctl:v5.0.4 pd --pd=host1:2379 -i

» store --jq=".stores[].store | { id, address, state_name}"
{"id":8,"address":"host1:20161","state_name":"Up"}
{"id":9,"address":"host2:20161","state_name":"Up"}
{"id":7,"address":"host1:20160","state_name":"Up"}
{"id":2,"address":"host3:20160","state_name":"Up"}
{"id":10,"address":"host2:20160","state_name":"Up"}
{"id":1,"address":"host3:20161","state_name":"Up"}

» region 1512
      "store_id": 10,
      "store_id": 1,
      "store_id": 8,

» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1135,"peer_stores":[9,8,1]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":805,"peer_stores":[10,7,1]}
{"id":1127,"peer_stores":[8,10,2]}
{"id":890,"peer_stores":[10,7,1]}
{"id":175,"peer_stores":[10,7,1]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id":1023,"peer_stores":[7,1,10]}
{"id":1223,"peer_stores":[9,8,1]}
{"id":1512,"peer_stores":[1,8,10]}
{"id":941,"peer_stores":[10,1,8]}
{"id":874,"peer_stores":[8,10,1]}
{"id":1520,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1008,"peer_stores":[10,8,2]}
{"id":917,"peer_stores":[2,8,10]}
{"id":669,"peer_stores":[2,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":184,"peer_stores":[8,1,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":129,"peer_stores":[8,1,9]}
{"id":988,"peer_stores":[2,10,8]}
{"id":1087,"peer_stores":[10,7,1]}
{"id":977,"peer_stores":[2,10,8]}
{"id":971,"peer_stores":[1,9,8]}
{"id":1278,"peer_stores":[7,1,10]}
{"id":796,"peer_stores":[2,8,10]}
{"id":1141,"peer_stores":[9,8,1]}
{"id":1097,"peer_stores":[10,7,1]}
{"id":900,"peer_stores":[10,1,8]}
{"id":1019,"peer_stores":[10,1,8]}
{"id":1169,"peer_stores":[8,10,2]}
{"id":1145,"peer_stores":[10,1,7]}
{"id":859,"peer_stores":[1,8,10]}
{"id":896,"peer_stores":[8,10,2]}
{"id":866,"peer_stores":[10,7,1]}
{"id":153,"peer_stores":[8,1,9]}
  1. 模拟故障
# 清理掉tikv部署文件
mv /data/tidb-deploy/tikv-20160 /data/tidb-deploy/tikv-20160_bak
# kill 任务
ps -ef|grep 20160
kill 12374

# mv 掉数据目录;如果磁盘空间不足直接rm -rf 掉文件目录。
mv /data/tidb-data/tikv_data_20160 /data/tidb-data/tikv_data_20160_bak

tiup ctl:v5.0.4 pd --pd=host1:2379 -i

store --jq=".stores[].store | { id, address, state_name}"

config show grep schedule-limit

config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0
config set hot-region-schedule-limit 0

tiup cluster list tiup cluster display tidb_expertise_5.0

tiup cluster stop tidb_xxx_5.0 -R=tidb -y 
  • 停掉TiKV节点
tiup cluster stop tidb_xxx_5.0 -R=tikv -y

在所有已停掉的正常TiKV执行unsafe-recover 恢复TiKV,此时可能数据丢失

# bj3-all-tidb-xxx-01:
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-02:
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20160/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
# bj3-all-tidb-xxx-03
tiup ctl:v5.0.4 tikv  --db /data/tidb-data/tikv_data_20161/db unsafe-recover remove-fail-stores -s 8,1,10 --all-regions
  • 缩容故障TiKV,重启TiKV,PD
tiup cluster scale-in tidb_expertise_5.0 -N=host1:20160,host2:20161,host3:20160 -y 
tiup cluster stop tidb_expertise_5.0 -R=pd
tiup cluster start tidb_expertise_5.0 -R=pd,tikv


tiup ctl:v5.0.4 pd --pd=host1:2379 -i
config set leader-schedule-limit 4
config set region-schedule-limit 2048
config set replica-schedule-limit 64
config set merge-schedule-limit 8
config set hot-region-schedule-limit 4

[tidb@bj3-all-tidb-expertise-01 ~]$ tiup cluster display tidb_expertise_5.0
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.10.2/tiup-cluster display tidb_expertise_5.0
Cluster type:       tidb
Cluster name:       tidb_expertise_5.0
Cluster version:    v5.0.4
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://bj3-all-tidb-xxx-01:2379/dashboard
Grafana URL:        http://bj3-all-tidb-xxx-01:3000
ID                   Role        Host           Ports        OS/Arch       Status        Data Dir                         Deploy Dir
--                   ----        ----           -----        -------       ------        --------                         ----------
bj3-all-tidb-xxx-02:8300   cdc         bj3-all-tidb-xxx-02  8300         linux/x86_64  Up            /data/tidb-data/cdc-8300         /data/tidb-deploy/cdc-8300
bj3-all-tidb-xxx-01:3000    grafana     bj3-all-tidb-xxx-01   3000         linux/x86_64  Up            -                                /data/tidb-deploy/grafana-3000
bj3-all-tidb-xxx-01:2379    pd          bj3-all-tidb-xxx-01   2379/2380    linux/x86_64  Up|L|UI       /data/tidb-data/pd_data          /data/tidb-deploy/pd-2379
bj3-all-tidb-xxx-01:9090    prometheus  bj3-all-tidb-xxx-01   9090         linux/x86_64  Up            /data/tidb-data/prometheus-9090  /data/tidb-deploy/prometheus-9090
bj3-all-tidb-xxx-02:4000   tidb        bj3-all-tidb-xxx-02  4000/10080   linux/x86_64  Up            -                                /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-03:4000     tidb        bj3-all-tidb-xxx-03    4000/10080   linux/x86_64  Up            -                                /data/tidb-deploy/tidb-4000
bj3-all-tidb-xxx-02:20161  tikv        bj3-all-tidb-xxx-02  20161/20181  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20161  /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-01:20161   tikv        bj3-all-tidb-xxx-01   20161/20181  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20161  /data/tidb-deploy/tikv-20161
bj3-all-tidb-xxx-03:20160    tikv        bj3-all-tidb-xxx-03    20160/20180  linux/x86_64  Disconnected  /data/tidb-data/tikv_data_20160  /data/tidb-deploy/tikv-20160
Total nodes: 9

检查故障tikv实例上的region

  • 多数派副本在故障的tikv节点:
tiup ctl:v5.0.4 pd --pd=bj3-all-tidb-xxx-01:2379 -i
» region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(8,1,10) then . else empty end) | length>=$total-length) }"
{"id":1512,"peer_stores":[1,8,10]}
{"id":1528,"peer_stores":[1,8,10]}
{"id":1131,"peer_stores":[1,10,8]}
{"id":133,"peer_stores":[1,10,8]}
{"id":707,"peer_stores":[8,10,1]}
{"id":1536,"peer_stores":[1,8,10]}
{"id":1270,"peer_stores":[8,10,1]}
{"id":874,"peer_stores":[8,10,1]}
{"id":184,"peer_stores":[8,1,10]}
{"id":712,"peer_stores":[8,10,1]}
{"id":1019,"peer_stores":[10,1,8]}
{"id":859,"peer_stores":[1,8,10]}
{"id":941,"peer_stores":[10,1,8]}
{"id":1520,"peer_stores":[1,8,10]}
{"id":900,"peer_stores":[10,1,8]}
  • 没有leader的 region: 正常情况此步骤没有结果。
» region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
  • 检查region是哪些表
    结果:所有表都是业务数据,非系统表。
curl http://bj3-all-tidb-xxx-02:10080/regions/1590 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1211 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1630 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/731 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1316 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1598 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/905 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1574 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/706 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/715 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/1582 |grep -E "table_name|db_name"
curl http://bj3-all-tidb-xxx-02:10080/regions/979 |grep -E "table_name|db_name"
  • 重建故障region:
    此操作需要关闭其中一个健康的TiKV,执行以下命令:
停掉 bj3-all-tidb-xxx-03:20161 Tikv节点
tiup cluster stop tidb_expertise_5.0 -N=bj3-all-tidb-xxx-03:20160


# 在已停掉的bj3-all-tidb-xxx-03:20161 Tikv节点 执行recreate-region
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1590
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1211
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1630
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 731
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1316
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1598
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 905
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1574
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 706
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 715
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 1582
tiup ctl:v5.0.4 tikv --db /data/tidb-data/tikv_data_20160/db recreate-region -p 'bj3-all-tidb-xxx-01:2379' -r 979

清理已删除tikv,并restart集群

tiup cluster scale-in tidb_expertise_5.0 -N=bj3-all-tidb-xxx-01:20160,bj3-all-tidb-xxx-02:20161,bj3-all-tidb-xxx-03:20160 -y --force
tiup cluster restart tidb_expertise_5.0  -y
Starting component blackbox_exporter
	Starting instance host1
	Starting instance bj3-all-tidb-xxx-01
	Starting instance bj3-all-tidb-xxx-02
	Start bj3-all-tidb-xxx-02 success
	Start bj3-all-tidb-xxx-03 success
	Start bj3-all-tidb-xxx-01 success
Restarted cluster `tidb_expertise_5.0` successfully

清理故障tikv后使用sync-diff-inspector 数据校验,并补数据

补数据发现:

set session sql_mode='';
mysql> REPLACE INTO `sbtest`.`sbtest1`(`I_ID`,xxxxxxx,`I_TASK_ID`,`D_CREATED_AT`,`D_UPDATED_AT`,`I_YOS_VERSION`) VALUES (3,2);
ERROR 1105 (HY000): can not be duplicated row, due to old row not found. handle 3 not found

tidb.log

[2022/09/08 16:26:12.608 +08:00] [ERROR] [replace.go:68] ["get old row failed when replace"] [handle=3] [toBeInsertedRow="(3, , 2, 0,
 0, 0, 0, 0, , , , , , 1, 0, 2020-05-20 13:46:57, , 2)"]
[2022/09/08 16:26:12.608 +08:00] [INFO] [tidb.go:242] ["rollbackTxn for ddl/autocommit failed"]
[2022/09/08 16:26:12.608 +08:00] [WARN] [session.go:1458] ["run statement failed"] [conn=4825] [schemaVersion=48] [error="can not be duplicated row, due to old row not found
. handle 3 not found"] [session="{\
  \"currDBName\": \"\",\
  \"id\": 4825,\
  \"status\": 2,\
  \"strictMode\": true,\
  \"user\": {\
    \"Username\": \"root\",\
    \"Ho
stname\": \"127.0.0.1\",\
    \"CurrentUser\": false,\
    \"AuthUsername\": \"root\",\
    \"AuthHostname\": \"%\"\
  }\
}"]
, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="REPLACE INTO `sbtest`.`sbtest1`(`I_ID`,xxxx,xxxx,xxx) VALUES (3,xxx,xxx)"] [txn_mode=PESSIMISTIC] [err="can not be duplicated row, due to old row not found. handle 3 not found\
github.com/pingcap/errors.NotFoundf\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:117\
github.com/pingcap/tidb/executor.(*ReplaceExec).removeRow\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/replace.go:72\
github.com/pingcap/tidb/executor.(*ReplaceExec).removeIndexRow\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/replace.go:185\
github.com/pingcap/tidb/executor.(*ReplaceExec).replaceRow\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/replace.go:145\
github.com/pingcap/tidb/executor.(*ReplaceExec).exec\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/replace.go:238\
github.com/pingcap/tidb/executor.insertRows\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert_common.go:277\
github.com/pingcap/tidb/executor.(*ReplaceExec).Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/replace.go:253\
github.com/pingcap/tidb/executor.Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/executor.go:278\
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelayExecutor\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:537\
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelay\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:418\
github.com/pingcap/tidb/executor.(*ExecStmt).Exec\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:368\
github.com/pingcap/tidb/session.runStmt\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/session.go:1560\
github.com/pingcap/tidb/session.(*session).ExecuteStmt\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/session.go:1455\
github.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/driver_tidb.go:218\
github.com/pingcap/tidb/server.(*clientConn).handleStmt\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1694\
github.com/pingcap/tidb/server.(*clientConn).handleQuery\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1567\
github.com/pingcap/tidb/server.(*clientConn).dispatch\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1101\
github.com/pingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:859\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:478\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

看起来unsafe recover 导致索引和数据不一致了,admin check recover 或者重建下索引应该就可以了。

清理掉唯一索引后,能够完成数据导入。

mysql> alter table f_file_info drop index I_FILE_ID;
Query OK, 0 rows affected (0.51 sec)

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。