PITR删除截止到未来一个时间的日志时会导致无法做一致性恢复到指定时间点

【需求涉及的问题场景】
正常情况下:
1、开启PITR的log备份任务。
2、对集群进行全备。
3、指定清理N天后时间的log日志(即truncate --until ‘now() + N天’ 大于下次备份时间)。
4、对集群进行全备。
5、利用pitr对集群进行一致性时间点恢复(当前时间还未到now() + N天),报错失败
【期望的需求行为】
发起truncate --until ‘now() + N天’ ,在v1_stream_trancate_safepoint.txt文件中记录当前时间或者日志中最后一次归档过来的时间。
【需求可替代方案】
清理时间不写未来时间即可。

复现步骤:

#开启br log日志归档
[tidb@localhost backup]$ ./br log start -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T22.59.39+0800 
[2023/11/03 22:59:39.704 +08:00] [INFO] [collector.go:77] ["log start"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"] [pausing=false] [rangeCount=2]
[2023/11/03 22:59:43.242 +08:00] [INFO] [collector.go:77] ["log start success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [backup-checksum=10.426828ms] [total-take=3.60572121s]

#第一次对数据库做全备,让br log生效。
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.00.06+0800 
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:01:23.691 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=228] [ranges-succeed=228] [ranges-failed=0] [backup-checksum=5.81226722s] [backup-fast-checksum=4.925917ms] [backup-total-ranges=162] [total-take=1m16.807251374s] [backup-data-size(after-compressed)=5.011GB] [Size=5011473467] [BackupTS=445388844668551189] [total-kv=105338332] [total-kv-size=18.8GB] [average-speed=244.8MB/s]
[tidb@localhost backup]$ ./br log  status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.01.44+0800 
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2023-11-03 22:59:39.647 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data/backup/log
       speed(est.): 0.00 ops/s
checkpoint[global]: 2023-11-03 22:59:39.647 +0800; gap=2m6s

#在应急场景打算清理所有日志(但是我为了方便直接写了一个未来时间),我想等下次重新初始化一次备份然后自动归档日志。
[tidb@localhost backup]$ ./br log truncate --until "`date -d "+24 hours" "+%Y-%m-%d %H:%M:%S%z"`" -s "local:///data/backup/log"
Detail BR log in /tmp/br.log.2023-11-03T23.06.03+0800 
Reading Metadata... DONE { take = 0s }
We are going to remove 3 files, until 2023-11-04 23:06:03.0000.
Sure? (y/N) y
Clearing Data Files and Metadata  ::  DONE { take = 1ms, kv-count = 303791, kv-size = 13222545(13.22MB) }
[tidb@localhost backup]$ rm -rf /data/backup/data/*

#再一次做数据库全量备份
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.06.32+0800 
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:07:48.966 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=230] [ranges-succeed=230] [ranges-failed=0] [backup-checksum=5.858122395s] [backup-fast-checksum=4.970793ms] [backup-total-ranges=164] [total-take=1m16.051382355s] [backup-data-size(after-compressed)=5.026GB] [Size=5025814751] [BackupTS=445388945868980233] [total-kv=105642026] [total-kv-size=18.84GB] [average-speed=247.7MB/s]
[tidb@localhost backup]$ ./br log  status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.08.35+0800 
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2023-11-03 22:59:39.647 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data/backup/log
       speed(est.): 119925.14 ops/s
checkpoint[global]: 2023-11-03 23:06:55.546 +0800; gap=1m41s

#在这里假设集群存在问题,需要应急恢复到故障时间点。
[tidb@localhost backup]$ ./br log stop -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T23.10.32+0800 
[2023/11/03 23:10:32.676 +08:00] [INFO] [collector.go:77] ["log stop"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"]
[2023/11/03 23:10:36.245 +08:00] [INFO] [collector.go:77] ["log stop success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [total-take=3.610060354s]

#发现无法利用br restore point恢复到一致性时间点,原因是v1_stream_trancate_safepoint.txt记录的上一次清理的时间点是来自于人工传入的--until 参数里的时间点。
[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.13.42+0800 
Error: it has gap between full backup ts:445388945868980233(2023-11-03 23:06:33.946 +0800 CST) and log backup ts:445411586998272000(2023-11-04 23:06:03 +0800 CST). : [BR:Common:ErrInvalidArgument]invalid argument

#需要手工将这个时间点进行修改成当前时间点。但通过SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:45:00.00 +0800') * 1000 ), '000000000000000001' ), 2, 10 );方式生成2023-11-03 22:45:00.00 +0800对应的TSO写入到log/v1_stream_trancate_safepoint.txt 不能用。

mysql> SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:46:55.546 +0800') * 1000 ), '000000000000000001' ), 2, 10 ) as tso;
+--------------------+
| tso                |
+--------------------+
| 445388636958490625 |
+--------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select TIDB_PARSE_TSO(445388636958490625);
+------------------------------------+
| TIDB_PARSE_TSO(445388636958490625) |
+------------------------------------+
| 2023-11-03 22:46:55.546000         |
+------------------------------------+
1 row in set (0.00 sec)


[tidb@localhost backup]$ echo "445388636958490625" >log/v1_stream_trancate_safepoint.txt 

[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.41.00+0800 
Error: [BR:Common:ErrInvalidMetaFile]invalid metafile: failed to parse the truncate safepoint

建议:./br log truncate --until "未来时间"这种形式最大只能清理到当前时间,这个当前时间希望取自于归档过来的最后一份日志时间(因为如果是真正的当前时间,那么因为br log归档的慢实际上真正的当前时间的日志还是会存在的,但因为safepoint.txt 是当前时间会导致迟缓归档过来的日志也不能用于恢复)。同样的这个truncate --until “特定时间”,这个特定时间希望根据日志判断,避免br归档极其缓慢的情况下虽然不能恢复特定时间以前的br log了但是因为br后面还是归档成功了存在br log 导致有br log但是不能恢复的情况。

另外,针对目前场景,假设错误的将truncate_safepoint时间写成了未来时间,那么想填写一个过去时间,怎么转换成这里面的TSO呢?(我是通过重新log truncate --until "过去时间“的方式不能改变这个文件,它总是记录曾经执行过的最大时间)

1 个赞

未来会把 PITR 恢复的恢复的起始 ts 和 能够删除的 ts 这些元信息都放到系统表里,就不会出现类似这类故障了,和 https://internal.pingcap.net/jira/browse/FRM-1099 部分需求比较像。已经在规划中了。

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。