tiflash不同步，日志中报错

qq24681430 · 2021 年9 月 23 日 02:42

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：
【 TiDB 使用环境】
【概述】场景+问题概述
tidb2 pd2 tikv1 tiflash1(扩容)
【背景】做过哪些操作

【现象】业务和数据库现象
根据日志报错telnet 10080端口是OK的
【业务影响】
【TiDB 版本】
v4.0.11
【附件】

相关日志和监控

TiUP Cluster Display 信息
$ tiup cluster display test-cluster
Found cluster newer version:

The latest version: v1.5.6
Local installed version: v1.4.1
Update current component: tiup update cluster
Update all components: tiup update --all

Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster display test-cluster
Cluster type: tidb
Cluster name: test-cluster
Cluster version: v4.0.11
SSH type: builtin
Dashboard URL: http://200.100.1.13:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

200.100.1.13:9093 alertmanager 200.100.1.13 9093/9094 linux/x86_64 Up /tidbdata/deploy/data.alertmanager /tidbdata/deploy
200.100.1.13:3000 grafana 200.100.1.13 3000 linux/x86_64 Up - /tidbdata/deploy
200.100.1.13:2379 pd 200.100.1.13 2379/2380 linux/x86_64 Up|UI /tidbdata/deploy/data.pd /tidbdata/deploy
200.100.1.17:2379 pd 200.100.1.17 2379/2380 linux/x86_64 Up|L /tidbdata/deploy/pd-2379/data /tidbdata/deploy/pd-2379
200.100.1.13:9090 prometheus 200.100.1.13 9090 linux/x86_64 Up /tidbdata/deploy/prometheus2.0.0.data.metrics /tidbdata/deploy
200.100.1.17:9090 prometheus 200.100.1.17 9090 linux/x86_64 Up /tidbdata/deploy/prometheus-9090/data /tidbdata/deploy/prometheus-9090
200.100.1.13:4000 tidb 200.100.1.13 4000/10080 linux/x86_64 Up - /tidbdata/deploy
200.100.1.17:4000 tidb 200.100.1.17 4000/10080 linux/x86_64 Up - /tidbdata/deploy/tidb-4000
200.100.1.17:19000 tiflash 200.100.1.17 19000/18123/13930/30170/10292/18234 linux/x86_64 Up /tidbdata/deploy2/data /tidbdata/deploy2/tiflash-9000
200.100.1.13:20160 tikv 200.100.1.13 20160/20180 linux/x86_64 Up /tidbdata/deploy/data /tidbdata/deploy

TiUP Cluster Edit Config 信息
TiDB- Overview 监控

对应模块日志（包含问题前后1小时日志）

vi tiflash_error.log

2021.09.23 10:25:22.024024 [ 32 ] pingcap.pd: write tso failed
2021.09.23 10:25:22.024152 [ 32 ] pd/oracle: update ts error: Exception: write tso failed
2021.09.23 10:25:22.024325 [ 31 ] pingcap.pd: get member failed: 14: failed to connect to all addresses
2021.09.23 10:25:22.024376 [ 31 ] pingcap.pd: failed to get cluster id by :http://200.100.1.13:2379
2021.09.23 10:25:22.025497 [ 31 ] pingcap.pd: failed to get cluster id by :http://200.100.1.17:2379
2021.09.23 10:25:22.025569 [ 31 ] pingcap.pd: Exception: failed to update leader
2021.09.23 10:25:23.500883 [ 4 ] pingcap.pd: get safe point failed: 2: rpc error: code = Unavailable desc = not leader

vi tiflash_cluster_manager.log

2021-09-23 10:51:17,180 root: can not get tiflash replica info from tidb: [(‘200.100.1.13:10080’, ReadTimeout(ReadTimeoutError(“HTTPConnectionPool(host=‘200.100.1.13’, port=10080): Read timed out. (read timeout=5)”,),))]
Traceback (most recent call last):
File “flash_cluster_manager.py”, line 286, in main
File “flash_cluster_manager.py”, line 129, in init
File “flash_cluster_manager.py”, line 29, in wrap_func
File “flash_cluster_manager.py”, line 238, in table_update
File “tidb_tools.py”, line 42, in db_flash_replica
Exception: can not get tiflash replica info from tidb: [(‘200.100.1.13:10080’, ReadTimeout(ReadTimeoutError(“HTTPConnectionPool(host=‘200.100.1.13’, port=10080): Read timed out. (read timeout=5)”,),))]

spc_monkey · 2021 年9 月 23 日 04:07

端口不通？建议先 asktug 搜一下相关帖子，有 tiflash 整个同步的排查

qq24681430 · 2021 年9 月 23 日 05:57

端口telnet没问题的，按帖子排查过，未发现问题

spc_monkey · 2021 年9 月 23 日 06:01

qq24681430 · 2021 年9 月 23 日 09:03

您发的文章和日志中的报错好像没有关联，需要检查的内容我基本上都检查了，没发现问题

spc_monkey · 2021 年9 月 23 日 09:11

可你上面的报错，提示的是 tiflash 和这些端口不通

qq24681430 · 2021 年9 月 24 日 02:29

问题好像出在这里

vi tiflash_error.log

2021.09.23 11:38:50.635654 [ 21 ] pingcap.tikv: region {32425,3,8} find error: region 32425 is missing
2021.09.23 12:06:04.482850 [ 25 ] pingcap.tikv: region {32425,6,8} find error: region 32425 is missing
2021.09.23 12:53:55.844680 [ 24 ] pingcap.tikv: region {32425,9,8} find error: region 32425 is missing
2021.09.23 13:00:15.229413 [ 27 ] pingcap.tikv: region {32425,12,8} find error: region 32425 is missing
2021.09.23 13:00:15.260998 [ 27 ] pingcap.tikv: region {312644,3,20} find error: region 312644 is missing
2021.09.23 13:06:07.296700 [ 28 ] pingcap.tikv: region {312644,6,20} find error: region 312644 is missing
2021.09.23 13:11:55.929209 [ 31 ] pingcap.tikv: region {32425,15,8} find error: region 32425 is missing
2021.09.23 13:11:55.957615 [ 31 ] pingcap.tikv: region {312644,9,20} find error: region 312644 is missing
2021.09.23 13:38:54.269852 [ 35 ] pingcap.tikv: region {312644,12,20} find error: region 312644 is missing
2021.09.23 13:45:13.917874 [ 28 ] pingcap.tikv: region {312644,15,20} find error: region 312644 is missing
2021.09.23 14:00:12.747666 [ 34 ] pingcap.tikv: region {32425,18,8} find error: region 32425 is missing
2021.09.23 14:00:12.826069 [ 34 ] pingcap.tikv: region {312644,18,20} find error: region 312644 is missing

spc_monkey · 2021 年9 月 24 日 02:35

这个看看这个region 属于哪个表，把这个表的 replic 变为0，再重新同步

qq24681430 · 2021 年9 月 26 日 02:10

试过了，不行!
报错的region不是表也不是索引image|690x114

spc_monkey · 2021 年9 月 26 日 04:01

问个问题，你以前 tiflash 状态是正常的不

qq24681430 · 2021 年9 月 26 日 04:49

之前tiflash正常的，后来因raid卡问题丢了一部分数据，用unsafe recover 做一下数据恢复，然后缩容tiflash，再扩容到新服务器上。
新服务器的tidb pd tikv tiflash全部是新扩容上去的

spc_monkey · 2021 年9 月 26 日 06:27

这个，现在集群中有多少个表，开启了 tiflash replica？
1、可以全部开启 tiflash replica 的表，全部设置 0 ，再开启
2、现在集群是是生产集群还是测试集群，可以通过pd-ctl region 命令，看看这个region 的信息，目前看这个 region 是异常的，需要你判断是否可以重建这个 region（创建了一个空 region），具体命令：tikv-ctl --db /path/to/tikv-data/db recreate-region --pd -r <region_id>

qq24681430 · 2021 年9 月 26 日 06:32

1、因为同步无进度，开启了 tiflash replica的表减少到了三个
2、全部设置 0 ，再开启测试过，没效果
3、生产集群，其中一个region信息如下：
[tidb@crm-dc-13 v4.0.11]$ ./pd-ctl region 312644
{
“id”: 312644,
“start_key”: “”,
“end_key”: “6D44427300000000FF00FA000000000000FF006844423A343500FF0000FC0000000000FA”,
“epoch”: {
“conf_ver”: 227,
“version”: 20
},
“peers”: [
{
“id”: 480207,
“store_id”: 1
},
{
“id”: 480250,
“store_id”: 390017
}
],
“leader”: {
“id”: 480207,
“store_id”: 1
},
“written_bytes”: 78536,
“read_bytes”: 2086779,
“written_keys”: 72,
“read_keys”: 33124,
“approximate_size”: 48,
“approximate_keys”: 18068
}

spc_monkey · 2021 年9 月 26 日 06:40

方便把 tiflash 的几个日志都提供一下不

qq24681430 · 2021 年9 月 26 日 06:46

tilog.tar.gz (24.5 MB)

spc_monkey · 2021 年9 月 26 日 06:52

你确定你 tiflash 和 tidb- server/pd-server 的端口是 ok 的吗？（日志有明显报错啊）

qq24681430 · 2021 年9 月 26 日 07:02

[root@localhost tiflash-9000]# telnet 200.100.1.13 2379
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.
[root@localhost tiflash-9000]# telnet 200.100.1.13 4000
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.
W
(��.tt|~K.>YmJ{mysql_native_password
[root@localhost tiflash-9000]# telnet 200.100.1.13 10080
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.

^C
Connection closed by foreign host.
[root@localhost tiflash-9000]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq portid 7cd30ae19bc8 state UP qlen 1000
link/ether 7c:d3:0a:e1:9b:c8 brd ff:ff:ff:ff:ff:ff
inet 200.100.1.167/16 brd 200.100.255.255 scope global eno1
valid_lft forever preferred_lft forever
inet6 fe80::23ce:5321:e14c:ccae/64 scope link
valid_lft forever preferred_lft forever

spc_monkey · 2021 年9 月 26 日 07:07

能否在 tiflash 节点上，执行这个命令看看，ok 不： ```
curl http://:/tiflash/replica

qq24681430 · 2021 年9 月 26 日 07:10

[root@localhost tiflash-9000]# curl http://200.100.1.167:8123/tiflash/replica
There is no handle /tiflash/replica

Use / or /ping for health checks.
Or /replicas_status for more sophisticated health checks.

Send queries from your program with POST method or GET /?query=…

Use clickhouse-client:

For interactive data analysis:
clickhouse-client

For batch query processing:
clickhouse-client --query=‘SELECT 1’ > result
clickhouse-client < query > result
[root@localhost tiflash-9000]#

spc_monkey · 2021 年9 月 26 日 07:22

我问问其他人吧