生产环境 tiup 升级v4.0.2 到v4.0.4 tiflash 等待超时 timed out waiting for port 9000 to be started after 2m0s

yourchanges · 2020 年8 月 20 日 05:56

tiup v1.0.9

tidb v4.0.2 (从3.x升级上来的, 也是tiup upgrade升级, 升级过程顺利)

v4.0.2 升级失败日志:
tiup-cluster-debug-2020-08-20-01-35-19.log (769.0 KB)

tiflash 日志没有见明显错误, 进程一直在, 就是没端口监听

tiflash.log (277.0 KB) tiflash_error.log (2.9 KB)

执行过一次upgrade 后, 在tidb dashboard里 tiflash 已经是离线状态了.

yilong · 2020 年8 月 20 日 06:32

debug 日志：
2020-08-20T01:35:19.560+0800 DEBUG TaskFinish {“task”: “UpgradeCluster”, “error”: “\t192.168.254.12 failed to restart: timed out waiting for port 9000 to be started after 2m0s: timed out waiting for port 9000 to be started after 2m0s”, “errorVerbose”: "timed out waiting for port 9000 to be started after
tiflahs 日志：
2020.08.20 01:33:26.561529 [ 1 ] DeltaMergeStore[db_45.t_3062]: Release DeltaMerge Store start [db_45.t_3062]
2020.08.20 01:33:26.561537 [ 1 ] DeltaMergeStore[db_45.t_3062]: Release DeltaMerge Store end [db_45.t_3062]
2020.08.20 01:33:28.130962 [ 1 ] Application: Let tiflash proxy shutdown
2020.08.20 01:33:28.130994 [ 1 ] Application: Wait for tiflash proxy thread to join
tiflash_error日志的时间是问题发生前的
2020.08.20 00:31:38.310447 [ 23 ] pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 47 store_id: 4)
2020.08.20 00:31:59.289970 [ 13 ] pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
稍等我们再分析下

birdstorm · 2020 年8 月 20 日 06:46

请问能否提供一下部署时的 topo.yaml 文件？并且把 tiflash 目录下的 .log 及 .toml 文件打包一下发一下吧，我们分析一下。

yourchanges · 2020 年8 月 20 日 07:37

tiflash.tar.gz (3.1 MB)

日志重复的处理了下, 不然太大

cat tiflash.log | grep -v “Available space is only” | grep -v “PageStorage” > tiflash2.log

下面两种去掉了

PageStorage: db_45.t_49.meta GC exit within 0.00 sec. PageFiles from [1,0,Formal] to [3,0,Legacy], size: 3, compact legacy archive files: 0, remove data files: 0, gc apply: 0 puts and 0 refs and 0 deletes and 0 upserts

HandleComputeFsStats: Available space is only 19.4508% of capacity size

birdstorm · 2020 年8 月 20 日 08:08

先强制杀掉 tiflash 进程吧，然后重新通过 tiup 拉起来，应该就能恢复了。这个问题是因为 v4.0.3 及 v4.0.4 存在一个死锁等待的问题，已修复但还未发版。

yourchanges · 2020 年8 月 20 日 17:17

强行杀掉一样的还是启动失败

yourchanges · 2020 年8 月 21 日 01:13

另外 tiflash 的二进制有1.1GB 大小? 也太奇怪了

[root@data9 tiflash-9000]# ls -lh bin/tiflash/ -lh
总用量 1.1G
drwxr-xr-x 10 tidb tidb  4.0K 7月  28 19:14 flash_cluster_manager
-rwxr-xr-x  1 tidb tidb   54M 7月  28 19:32 libtiflash_proxy.so
-rw-r--r--  1 tidb tidb  986K 4月  16 15:23 PingCAP Community Software Agreement(Chinese Version).pdf
-rw-r--r--  1 tidb tidb  155K 4月  16 15:23 PingCAP Community Software Agreement(English Version).pdf
-rwxr-xr-x  1 tidb tidb 1018M 7月  28 19:32 tiflash

有绕过tiflash 升级tidb集群的方法么,现在生产还是4.0.2

lonng · 2020 年8 月 21 日 02:15

这几天就会发布新版本 v4.0.5 会修复 tiflash 这个问题，请问您这边方便等到到时候再升级到最新版吗？

solotzg-PingCAP · 2020 年8 月 21 日 02:19

4.0.4版本，初始化阶段抛出异常没有做好善后处理就会导致该死锁等待问题，麻烦把log级别调至debug再重启一次，我们排查下引发异常的原因

yourchanges · 2020 年8 月 21 日 08:46

只有凌晨敢搞, 白天是生产, 怕一个不小心

QBin · 2020 年8 月 22 日 08:19

好的。麻烦拿到调整后的日志再上传一下。

yourchanges · 2020 年9 月 1 日 01:21

tiflash.tar.gz (50.2 KB)

这个是开启debug 后的日志,其他之前的日志都清理了, 现在这个集群, 除了tidb tikiv pd 都手工重启升级到v4.0.4了, 但是升级完成后, tiflash 还是有问题

JaySon-Huang · 2020 年9 月 1 日 03:40

从日志看是在从磁盘加载一个 db 的表 schema 信息时出现问题了。
请问可以打包上传一下这个文件夹的文件吗？/data/tidb-data/tiflash-9000/metadata/db_4628
可以通过站内私信或者发到这个邮箱 huangjunshen@pingcap.com 发送给我。

另外一个方法是升级到昨天发布的 v4.0.5 版本，这样我们可以通过错误日志观察到错误的具体原因

yourchanges · 2020 年9 月 1 日 06:16

db_4628.tar.gz (2.9 KB)

已经发4.0.5了? 我去升级下

不过我看到已经有好几个报 4.0.4 升级到4.0.5 有问题了.

JaySon-Huang · 2020 年9 月 1 日 07:18

应该是 t_5504.sql 这个文件所对应的表，主键的名字有过修改（从msgid 修改为 msg_id）。在 v4.0.4 版本之前，TiFlash 侧没有正确地处理主键名字修改的 DDL 操作，而 v4.0.4 版本之后，在启动时会检查出这个异常而报错。

修复的方式需要您手动更改 t_5504.sql 文件，把其中 ENGINE = DeltaMerge(msgid, ... 修改为 ENGINE = DeltaMerge(msg_id, ...，然后再启动 TiFlash 进程看是否能顺利启动。

注意：如果想在修改前备份 schema 的文件，请把备份的文件放到 TiFlash 的 metadata/ 文件夹外，否则会造成启动异常

yourchanges · 2020 年9 月 1 日 12:04

直接升级到v4.0.5 也需要手工修改后才能升级么?

yourchanges · 2020 年9 月 1 日 13:11

新建的v4.0.4 集群和老的v4.0.4 集群通过tiup 升级到 v4.0.5 一样的失败了。还是tiflash 超时

yourchanges · 2020 年9 月 1 日 13:36

新的v4.0.4 集群 tiflash 每隔一会就打一个错误日志

[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table jiuxun_huishou(45).t_im_base_info(317)"] [thread_id=13]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table jiuxun_huishou(45).t_im_base_info(317), not altering"] [thread_id=13]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table jiuxun_huishou(45).t_push_base_info(332)"] [thread_id=13]
[2020/09/01 21:32:36.156 +08:00] [ERROR] [<unknown>] ["SchemaSyncService: DB::SchemaSyncService::SchemaSyncService(DB::Context&)::<lambda()>: Sync schemas failed by DB::Exception: Storage engine DeltaMerge doesn't support lossy data type modify from Nullable(Int32) to Int8:  table name: t_332, table id: 332 \
 stack : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x15) [0x3381595]\
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x1e) [0x337823e]\
2. bin/tiflash/tiflash(DB::StorageDeltaMerge::alterImpl(DB::AlterCommands const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::reference_wrapper<TiDB::TableInfo const> >, DB::Context const&)+0x4a9) [0x6736af9]\
3. bin/tiflash/tiflash(DB::StorageDeltaMerge::alterFromTiDB(DB::AlterCommands const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, TiDB::TableInfo const&, DB::SchemaNameMapper const&, DB::Context const&)+0xa4) [0x6738bd4]\
4. bin/tiflash/tiflash(DB::SchemaBuilder<DB::SchemaGetter, DB::SchemaNameMapper>::applyAlterPhysicalTable(std::shared_ptr<TiDB::DBInfo>, std::shared_ptr<TiDB::TableInfo>, std::shared_ptr<DB::IManageableStorage>)+0x372) [0x6d0baf2]\
5. bin/tiflash/tiflash(DB::SchemaBuilder<DB::SchemaGetter, DB::SchemaNameMapper>::applyAlterLogicalTable(std::shared_ptr<TiDB::DBInfo>, std::shared_ptr<TiDB::TableInfo>, std::shared_ptr<DB::IManageableStorage>)+0x6c) [0x6d0c17c]\
6. bin/tiflash/tiflash(DB::SchemaBuilder<DB::SchemaGetter, DB::SchemaNameMapper>::syncAllSchema()+0xa47) [0x6d0d6d7]\
7. bin/tiflash/tiflash(DB::TiDBSchemaSyncer<false>::syncSchemas(DB::Context&)+0x487) [0x69da287]\
8. bin/tiflash/tiflash(DB::SchemaSyncService::syncSchemas()+0x2e) [0x69ca37e]\
9. bin/tiflash/tiflash() [0x69cc1a4]\
10. bin/tiflash/tiflash(DB::BackgroundProcessingPool::threadFunction()+0x856) [0x685d446]\
11. bin/tiflash/tiflash() [0x7ca2cfe]\
12. /lib64/libpthread.so.0(+0x7dd4) [0x7f83351a4dd4]\
13. /lib64/libc.so.6(clone+0x6c) [0x7f8334bcc02c]\
"] [thread_id=13]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaSyncer: start to sync schemas. current version is: 0 and try to sync schema version to: 531"] [thread_id=11]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: Syncing all schemas."] [thread_id=11]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table mysql(3).tables_priv(11)"] [thread_id=11]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: No schema change detected for table mysql(3).tables_priv(11), not altering"] [thread_id=11]
[2020/09/01 21:32:36.156 +08:00] [INFO] [<unknown>] ["SchemaBuilder: Altering table mysql(3).columns_priv(13)"] [thread_id=11]

然后

[tidb@data11 ~]$ tiup cluster upgrade jiuji-tidb-cluster-v2 v4.0.5

报错：

Restarting component tiflash

Restarting instance 192.168.254.12

retry error: operation timed out after 2m0s

192.168.254.12 failed to restart: timed out waiting for port 8000 to be started after 2m0s

Error: 192.168.254.12 failed to restart: timed out waiting for port 8000 to be started after 2m0s: timed out waiting for port 8000 to be started after 2m0s

Verbose debug logs has been written to **/home/tidb/logs/tiup-cluster-debug-2020-09-01-21-21-06.log** .

Error: run `/home/tidb/.tiup/components/cluster/v1.1.0/tiup-cluster` (wd:/home/tidb/.tiup/data/S9LHkgo) failed: exit status 1

10分钟后从pd dashboard 看 tiflash 已经启动了变成 v4.0.5 其他组件还是v4.0.4

yourchanges · 2020 年9 月 1 日 13:41

手工执行

tiup cluster restart jiuji-tidb-cluster-v2 -R tikv

tiup cluster restart jiuji-tidb-cluster-v2 -R tidb

tiup cluster restart jiuji-tidb-cluster-v2 -R pd

从pd dashboard 看全部是v4.0.5 了不知道这种算不算升级成功，这个只是新集群，老集群的修改sql 有用，升级到v4.0.4 了只有等凌晨看看能不能升级到v4.0.5

但是命令行显示不对, 还是v4.0.4 不是pd dashboard的 v4.0.5 两个信息不一样

[tidb@data11 ~]$ tiup cluster display jiuji-tidb-cluster-v2
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.1.1/tiup-cluster display jiuji-tidb-cluster-v2
tidb Cluster: jiuji-tidb-cluster-v2
tidb Version: v4.0.4
ID                   Role          Host            Ports                            OS/Arch       Status  Data Dir                              Deploy Dir
--                   ----          ----            -----                            -------       ------  --------                              ----------
192.168.254.12:9095  alertmanager  192.168.254.12  9095/9096                        linux/x86_64  Up      /data/tidb-data-v2/alertmanager-9095  /data/tidb-deploy-v2/alertmanager-9095
192.168.254.12:3030  grafana       192.168.254.12  3030

JaySon-Huang · 2020 年9 月 1 日 14:07

这一层的问题是另外一个已知的问题。
这个错误的原因是在 TiDB 侧对 jiuxun_huishou.t_push_base_info 执行了不支持的 DDL 变更，将 int 类型的列有损变更为 tinyint。这个问题安全的解决方法为在 TiDB 侧将 jiuxun_huishou.t_push_base_info 进行 truncate table 操作，然后 TiFlash 会再同步信息时恢复。要看业务上是否能够接受。

对于再上一层查明的主键问题，目前仅能通过手动修改文件处理，4.0.5 版本不能解决这个问题。

另外一个解决这两种问题的办法为：参考官方文档，把所有 TiFlash 节点先进行缩容，然后再重新扩容。
但假如线上再进行有损变更的 DDL，还是会再出现异常。对这两个问题的兼容性修复，目前正在进行中，预计下一个小版本中提高兼容性。
https://docs.pingcap.com/zh/tidb/stable/scale-tidb-using-tiup#缩容-tiflash-节点

另外，对于有损的列类型变更功能在实现中，会于 5.0 版本推出。对您造成麻烦十分抱歉。