tidb binlog总是报tidb_TiDB_binlog_error_total，并且tidb_Drainer_server_is_down挂掉

tidb126 · 2021 年4 月 7 日 00:44

alert: tidb_TiDB_binlog_error_total
expr: increase(tidb_server_critical_error_total[3m])
  > 0
这个报警怎么找原因，这次，只有这个报警，其中一台tidb报警，drainer 没有出问题

小王同学 · 2021 年4 月 7 日 04:27

请参考 https://docs.pingcap.com/zh/tidb/stable/information-schema-inspection-result#critical-error-诊断规则

查 tidb 以及 pump 的日志。

tidb 采用两阶段提交的方式，两阶段提交是分布式事务比较常用的算法，分为 prepare 和 commit 阶段，

prepare 阶段：在 tidb 中 prepare 阶段 tidb 写数据到 TiKV ，同时并行的发送 prewrite binlog 到 pump；在写数据到 TiKV 和写 prewrite binlog 到 pump，两个动作任意一个失败了，就会认为是这个事务执行失败，会发一个 rollback binlog，这样做是为了保证 TIKV 里面的数据和同步到下游的数据是一致的。
commit 阶段：tidb 先发送 commit 到 TiKV ，然后再异步发送 commit binlog 到 pump

tidb126 · 2021 年4 月 7 日 06:29

pump log里面的error也就这个，没其它的
[2021/04/07 08:37:52.361 +08:00] [ERROR] [storage.go:444] [“GetMvccByEncodedKey failed”] [“start ts”=424084993120665606] [RegionError="message:“region 102314 is missing” region_not_found:<region_id:102314 > "]

tidb log：
[2021/04/07 08:38:46.786 +08:00] [WARN] [client.go:295] ["[pumps client] write binlog to pump failed"] [NodeID=x.x.x.x:8250] [“binlog type”=Prewrite] [“start ts”=424084969147596833] [“commit ts”=0] [length=2295560656] [error=“rpc error: code = ResourceExhausted desc = trying to send message larger than max (2295560672 vs. 2147483647)”]
[2021/04/07 08:38:46.796 +08:00] [ERROR] [binloginfo.go:253] [“write binlog failed”] [binlog_type=Prewrite] [binlog_start_ts=424084969147596833] [binlog_commit_ts=0] [error=“rpc error: code = ResourceExhausted desc = trying to send message larger than max (2295560672 vs. 2147483647)”]

小王同学 · 2021 年4 月 7 日 07:12

这个报错 asktug 搜索下可以查到相关帖子的。
修改下 pump 的这个参数。
-max-message-size
max message size tidb produce into pump (default 2147483647)

tidb126 · 2021 年4 月 7 日 07:16

tidb binlog 报错是不是就要因为 max-message-size引起的，导致写不了pump

tidb126 · 2021 年4 月 7 日 07:24

max-message-size 这个大小怎么修改，用tiup部署的

小王同学 · 2021 年4 月 7 日 09:28

调整 tidb 发给 pump 最大限制，通过 pump 命令行调整 “max-message-size” 参数，目前该参数不支持配置文件修改。在 pump 的部署机器上有 run_pump.sh 的启动脚本，添加参数。

tidb126 · 2021 年4 月 8 日 06:39

都什么情况会导致TiDB_binlog_error_total 报警，我这次看tidb log里面没有出现 trying to send message larger than max (2295560672 vs. 2147483647)”]
pump里面也没有错误信息，这还怎么排查

yilong · 2021 年4 月 9 日 10:06

能否发送下tidb日志呢？看看是否还有其他告警？
麻烦也检查下 pump 和 drainer 在这个时间段是否有告警日志，多谢。

tidb126 · 2021 年4 月 19 日 02:57

其它没用触发报警
只有物理机的cpu触发了报警，时间很短，1分钟恢复了
看备份日志，没有错误

[table=report_gov_hall_movie_box_film_cinema]
[2021/04/19 02:43:04.561 +08:00] [INFO] [client.go:206] [“save backup meta”] [path=local:///data/backup/br/2021041901/fulldata_backup] [size=22866092]
[2021/04/19 02:43:04.623 +08:00] [INFO] [ddl.go:394] ["[ddl] DDL closed"] [ID=09878349-bd81-4efd-8709-5f8f7a95cf23] [“take time”=4.591127ms]
[2021/04/19 02:43:04.634 +08:00] [INFO] [ddl.go:303] ["[ddl] stop DDL"] [ID=09878349-bd81-4efd-8709-5f8f7a95cf23]
[2021/04/19 02:43:04.637 +08:00] [INFO] [domain.go:452] [“infoSyncerKeeper exited.”]
[2021/04/19 02:43:04.638 +08:00] [INFO] [domain.go:622] [“domain closed”] [“take time”=23.432629ms]
[2021/04/19 02:43:04.639 +08:00] [INFO] [collector.go:61] [“Full backup Success summary: total backup ranges: 3081, total success: 3081, total failed: 0, total take(Full backup time): 1h5m16.810326505s, total take(real time): 1h9m55.501744862s, total kv: 12388142180, total size(MB): 2199469.77, avg speed(MB/s): 561.55”] [“backup checksum”=4m29.729510119s] [“backup fast checksum”=5.834638234s] [“backup total regions”=32896] [BackupTS=424350106176716801] [Size=217297962629]

yilong · 2021 年4 月 19 日 12:33

binlog 告警时间和cpu异常告警时间能对应上吗？

tidb126 · 2021 年4 月 20 日 00:38

binlog没有告警，现在binlog异常，一直没有进行恢复binlog处于挂的状态

yilong · 2021 年4 月 20 日 02:08

暂时不需要了吗？可以试着重新恢复下。

tidb126 · 2021 年4 月 20 日 02:09

恢复过好几次了，还是会挂，不用了，也没有同步下游的操作

yilong · 2021 年4 月 20 日 02:12

好吧，如果只是使用binlog备份file文件，感觉可以换成 BR 来备份。 Binlog 同步的功能应该可能也是会被 TiCDC 替代的，多谢。