tiflash不同步,日志中报错

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
【概述】场景+问题概述
tidb2 pd2 tikv1 tiflash1(扩容)
【背景】做过哪些操作
image
【现象】业务和数据库现象
根据日志报错telnet 10080端口是OK的
【业务影响】
【TiDB 版本】
v4.0.11
【附件】

  1. TiUP Cluster Display 信息
    $ tiup cluster display test-cluster
    Found cluster newer version:

    The latest version: v1.5.6
    Local installed version: v1.4.1
    Update current component: tiup update cluster
    Update all components: tiup update --all

Starting component cluster: /home/tidb/.tiup/components/cluster/v1.4.1/tiup-cluster display test-cluster
Cluster type: tidb
Cluster name: test-cluster
Cluster version: v4.0.11
SSH type: builtin
Dashboard URL: http://200.100.1.13:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


200.100.1.13:9093 alertmanager 200.100.1.13 9093/9094 linux/x86_64 Up /tidbdata/deploy/data.alertmanager /tidbdata/deploy
200.100.1.13:3000 grafana 200.100.1.13 3000 linux/x86_64 Up - /tidbdata/deploy
200.100.1.13:2379 pd 200.100.1.13 2379/2380 linux/x86_64 Up|UI /tidbdata/deploy/data.pd /tidbdata/deploy
200.100.1.17:2379 pd 200.100.1.17 2379/2380 linux/x86_64 Up|L /tidbdata/deploy/pd-2379/data /tidbdata/deploy/pd-2379
200.100.1.13:9090 prometheus 200.100.1.13 9090 linux/x86_64 Up /tidbdata/deploy/prometheus2.0.0.data.metrics /tidbdata/deploy
200.100.1.17:9090 prometheus 200.100.1.17 9090 linux/x86_64 Up /tidbdata/deploy/prometheus-9090/data /tidbdata/deploy/prometheus-9090
200.100.1.13:4000 tidb 200.100.1.13 4000/10080 linux/x86_64 Up - /tidbdata/deploy
200.100.1.17:4000 tidb 200.100.1.17 4000/10080 linux/x86_64 Up - /tidbdata/deploy/tidb-4000
200.100.1.17:19000 tiflash 200.100.1.17 19000/18123/13930/30170/10292/18234 linux/x86_64 Up /tidbdata/deploy2/data /tidbdata/deploy2/tiflash-9000
200.100.1.13:20160 tikv 200.100.1.13 20160/20180 linux/x86_64 Up /tidbdata/deploy/data /tidbdata/deploy

  1. TiUP Cluster Edit Config 信息

  2. TiDB- Overview 监控

  • 对应模块日志(包含问题前后1小时日志)

vi tiflash_error.log

2021.09.23 10:25:22.024024 [ 32 ] pingcap.pd: write tso failed
2021.09.23 10:25:22.024152 [ 32 ] pd/oracle: update ts error: Exception: write tso failed
2021.09.23 10:25:22.024325 [ 31 ] pingcap.pd: get member failed: 14: failed to connect to all addresses
2021.09.23 10:25:22.024376 [ 31 ] pingcap.pd: failed to get cluster id by :http://200.100.1.13:2379
2021.09.23 10:25:22.025497 [ 31 ] pingcap.pd: failed to get cluster id by :http://200.100.1.17:2379
2021.09.23 10:25:22.025569 [ 31 ] pingcap.pd: Exception: failed to update leader
2021.09.23 10:25:23.500883 [ 4 ] pingcap.pd: get safe point failed: 2: rpc error: code = Unavailable desc = not leader

vi tiflash_cluster_manager.log

2021-09-23 10:51:17,180 root: can not get tiflash replica info from tidb: [(‘200.100.1.13:10080’, ReadTimeout(ReadTimeoutError(“HTTPConnectionPool(host=‘200.100.1.13’, port=10080): Read timed out. (read timeout=5)”,),))]
Traceback (most recent call last):
File “flash_cluster_manager.py”, line 286, in main
File “flash_cluster_manager.py”, line 129, in init
File “flash_cluster_manager.py”, line 29, in wrap_func
File “flash_cluster_manager.py”, line 238, in table_update
File “tidb_tools.py”, line 42, in db_flash_replica
Exception: can not get tiflash replica info from tidb: [(‘200.100.1.13:10080’, ReadTimeout(ReadTimeoutError(“HTTPConnectionPool(host=‘200.100.1.13’, port=10080): Read timed out. (read timeout=5)”,),))]

端口不通?建议先 asktug 搜一下相关帖子,有 tiflash 整个同步的排查

端口telnet没问题的,按帖子排查过,未发现问题

您发的文章和日志中的报错好像没有关联,需要检查的内容我基本上都检查了,没发现问题

可你上面的报错,提示的是 tiflash 和这些 端口不通

问题好像出在这里

vi tiflash_error.log

2021.09.23 11:38:50.635654 [ 21 ] pingcap.tikv: region {32425,3,8} find error: region 32425 is missing
2021.09.23 12:06:04.482850 [ 25 ] pingcap.tikv: region {32425,6,8} find error: region 32425 is missing
2021.09.23 12:53:55.844680 [ 24 ] pingcap.tikv: region {32425,9,8} find error: region 32425 is missing
2021.09.23 13:00:15.229413 [ 27 ] pingcap.tikv: region {32425,12,8} find error: region 32425 is missing
2021.09.23 13:00:15.260998 [ 27 ] pingcap.tikv: region {312644,3,20} find error: region 312644 is missing
2021.09.23 13:06:07.296700 [ 28 ] pingcap.tikv: region {312644,6,20} find error: region 312644 is missing
2021.09.23 13:11:55.929209 [ 31 ] pingcap.tikv: region {32425,15,8} find error: region 32425 is missing
2021.09.23 13:11:55.957615 [ 31 ] pingcap.tikv: region {312644,9,20} find error: region 312644 is missing
2021.09.23 13:38:54.269852 [ 35 ] pingcap.tikv: region {312644,12,20} find error: region 312644 is missing
2021.09.23 13:45:13.917874 [ 28 ] pingcap.tikv: region {312644,15,20} find error: region 312644 is missing
2021.09.23 14:00:12.747666 [ 34 ] pingcap.tikv: region {32425,18,8} find error: region 32425 is missing
2021.09.23 14:00:12.826069 [ 34 ] pingcap.tikv: region {312644,18,20} find error: region 312644 is missing

这个看看这个region 属于哪个表,把这个表 的 replic 变为0,再重新同步

试过了,不行!
报错的region不是表也不是索引image|690x114

问个问题,你以前 tiflash 状态是正常的不

之前tiflash正常的,后来因raid卡问题丢了一部分数据,用unsafe recover 做一下数据恢复,然后缩容tiflash,再扩容到新服务器上。
新服务器的tidb pd tikv tiflash全部是新扩容上去的

这个,现在集群中 有多少个表,开启了 tiflash replica?
1、可以全部开启 tiflash replica 的表,全部 设置 0 ,再开启
2、现在集群是是生产集群还是测试集群,可以通过pd-ctl region 命令,看看这个region 的信息,目前看 这个 region 是异常的,需要你判断是否可以 重建这个 region(创建了一个 空 region),具体命令:tikv-ctl --db /path/to/tikv-data/db recreate-region --pd -r <region_id>

1、因为同步无进度,开启了 tiflash replica的表减少到了三个
2、全部 设置 0 ,再开启测试过,没效果
3、生产集群,其中一个region信息如下:
[tidb@crm-dc-13 v4.0.11]$ ./pd-ctl region 312644
{
“id”: 312644,
“start_key”: “”,
“end_key”: “6D44427300000000FF00FA000000000000FF006844423A343500FF0000FC0000000000FA”,
“epoch”: {
“conf_ver”: 227,
“version”: 20
},
“peers”: [
{
“id”: 480207,
“store_id”: 1
},
{
“id”: 480250,
“store_id”: 390017
}
],
“leader”: {
“id”: 480207,
“store_id”: 1
},
“written_bytes”: 78536,
“read_bytes”: 2086779,
“written_keys”: 72,
“read_keys”: 33124,
“approximate_size”: 48,
“approximate_keys”: 18068
}

方便把 tiflash 的 几个日志都提供一下不

tilog.tar.gz (24.5 MB)

你确定 你 tiflash 和 tidb- server/pd-server 的端口是 ok 的吗?(日志有明显报错啊)

[root@localhost tiflash-9000]# telnet 200.100.1.13 2379
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.
[root@localhost tiflash-9000]# telnet 200.100.1.13 4000
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.
W
(��.tt|~K.>YmJ{mysql_native_password
[root@localhost tiflash-9000]# telnet 200.100.1.13 10080
Trying 200.100.1.13…
Connected to 200.100.1.13.
Escape character is ‘^]’.

^C
Connection closed by foreign host.
[root@localhost tiflash-9000]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq portid 7cd30ae19bc8 state UP qlen 1000
link/ether 7c:d3:0a:e1:9b:c8 brd ff:ff:ff:ff:ff:ff
inet 200.100.1.167/16 brd 200.100.255.255 scope global eno1
valid_lft forever preferred_lft forever
inet6 fe80::23ce:5321:e14c:ccae/64 scope link
valid_lft forever preferred_lft forever

能否在 tiflash 节点上,执行这个命令看看,ok 不: ```
curl http://:/tiflash/replica

[root@localhost tiflash-9000]# curl http://200.100.1.167:8123/tiflash/replica
There is no handle /tiflash/replica

Use / or /ping for health checks.
Or /replicas_status for more sophisticated health checks.

Send queries from your program with POST method or GET /?query=…

Use clickhouse-client:

For interactive data analysis:
clickhouse-client

For batch query processing:
clickhouse-client --query=‘SELECT 1’ > result
clickhouse-client < query > result
[root@localhost tiflash-9000]#

我问问其他人吧:rofl::rofl: