【需求涉及的问题场景】
正常情况下:
1、开启PITR的log备份任务。
2、对集群进行全备。
3、指定清理N天后时间的log日志(即truncate --until ‘now() + N天’ 大于下次备份时间)。
4、对集群进行全备。
5、利用pitr对集群进行一致性时间点恢复(当前时间还未到now() + N天),报错失败。
【期望的需求行为】
发起truncate --until ‘now() + N天’ ,在v1_stream_trancate_safepoint.txt文件中记录当前时间或者日志中最后一次归档过来的时间。
【需求可替代方案】
清理时间不写未来时间即可。
复现步骤:
#开启br log日志归档
[tidb@localhost backup]$ ./br log start -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T22.59.39+0800
[2023/11/03 22:59:39.704 +08:00] [INFO] [collector.go:77] ["log start"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"] [pausing=false] [rangeCount=2]
[2023/11/03 22:59:43.242 +08:00] [INFO] [collector.go:77] ["log start success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [backup-checksum=10.426828ms] [total-take=3.60572121s]
#第一次对数据库做全备,让br log生效。
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.00.06+0800
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:01:23.691 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=228] [ranges-succeed=228] [ranges-failed=0] [backup-checksum=5.81226722s] [backup-fast-checksum=4.925917ms] [backup-total-ranges=162] [total-take=1m16.807251374s] [backup-data-size(after-compressed)=5.011GB] [Size=5011473467] [BackupTS=445388844668551189] [total-kv=105338332] [total-kv-size=18.8GB] [average-speed=244.8MB/s]
[tidb@localhost backup]$ ./br log status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.01.44+0800
● Total 1 Tasks.
> #1 <
name: pitr
status: ● NORMAL
start: 2023-11-03 22:59:39.647 +0800
end: 2090-11-18 22:07:45.624 +0800
storage: local:///data/backup/log
speed(est.): 0.00 ops/s
checkpoint[global]: 2023-11-03 22:59:39.647 +0800; gap=2m6s
#在应急场景打算清理所有日志(但是我为了方便直接写了一个未来时间),我想等下次重新初始化一次备份然后自动归档日志。
[tidb@localhost backup]$ ./br log truncate --until "`date -d "+24 hours" "+%Y-%m-%d %H:%M:%S%z"`" -s "local:///data/backup/log"
Detail BR log in /tmp/br.log.2023-11-03T23.06.03+0800
Reading Metadata... DONE { take = 0s }
We are going to remove 3 files, until 2023-11-04 23:06:03.0000.
Sure? (y/N) y
Clearing Data Files and Metadata :: DONE { take = 1ms, kv-count = 303791, kv-size = 13222545(13.22MB) }
[tidb@localhost backup]$ rm -rf /data/backup/data/*
#再一次做数据库全量备份
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.06.32+0800
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:07:48.966 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=230] [ranges-succeed=230] [ranges-failed=0] [backup-checksum=5.858122395s] [backup-fast-checksum=4.970793ms] [backup-total-ranges=164] [total-take=1m16.051382355s] [backup-data-size(after-compressed)=5.026GB] [Size=5025814751] [BackupTS=445388945868980233] [total-kv=105642026] [total-kv-size=18.84GB] [average-speed=247.7MB/s]
[tidb@localhost backup]$ ./br log status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.08.35+0800
● Total 1 Tasks.
> #1 <
name: pitr
status: ● NORMAL
start: 2023-11-03 22:59:39.647 +0800
end: 2090-11-18 22:07:45.624 +0800
storage: local:///data/backup/log
speed(est.): 119925.14 ops/s
checkpoint[global]: 2023-11-03 23:06:55.546 +0800; gap=1m41s
#在这里假设集群存在问题,需要应急恢复到故障时间点。
[tidb@localhost backup]$ ./br log stop -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T23.10.32+0800
[2023/11/03 23:10:32.676 +08:00] [INFO] [collector.go:77] ["log stop"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"]
[2023/11/03 23:10:36.245 +08:00] [INFO] [collector.go:77] ["log stop success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [total-take=3.610060354s]
#发现无法利用br restore point恢复到一致性时间点,原因是v1_stream_trancate_safepoint.txt记录的上一次清理的时间点是来自于人工传入的--until 参数里的时间点。
[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.13.42+0800
Error: it has gap between full backup ts:445388945868980233(2023-11-03 23:06:33.946 +0800 CST) and log backup ts:445411586998272000(2023-11-04 23:06:03 +0800 CST). : [BR:Common:ErrInvalidArgument]invalid argument
#需要手工将这个时间点进行修改成当前时间点。但通过SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:45:00.00 +0800') * 1000 ), '000000000000000001' ), 2, 10 );方式生成2023-11-03 22:45:00.00 +0800对应的TSO写入到log/v1_stream_trancate_safepoint.txt 不能用。
mysql> SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:46:55.546 +0800') * 1000 ), '000000000000000001' ), 2, 10 ) as tso;
+--------------------+
| tso |
+--------------------+
| 445388636958490625 |
+--------------------+
1 row in set, 1 warning (0.00 sec)
mysql> select TIDB_PARSE_TSO(445388636958490625);
+------------------------------------+
| TIDB_PARSE_TSO(445388636958490625) |
+------------------------------------+
| 2023-11-03 22:46:55.546000 |
+------------------------------------+
1 row in set (0.00 sec)
[tidb@localhost backup]$ echo "445388636958490625" >log/v1_stream_trancate_safepoint.txt
[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.41.00+0800
Error: [BR:Common:ErrInvalidMetaFile]invalid metafile: failed to parse the truncate safepoint
建议:./br log truncate --until "未来时间"这种形式最大只能清理到当前时间,这个当前时间希望取自于归档过来的最后一份日志时间(因为如果是真正的当前时间,那么因为br log归档的慢实际上真正的当前时间的日志还是会存在的,但因为safepoint.txt 是当前时间会导致迟缓归档过来的日志也不能用于恢复)。同样的这个truncate --until “特定时间”,这个特定时间希望根据日志判断,避免br归档极其缓慢的情况下虽然不能恢复特定时间以前的br log了但是因为br后面还是归档成功了存在br log 导致有br log但是不能恢复的情况。
另外,针对目前场景,假设错误的将truncate_safepoint时间写成了未来时间,那么想填写一个过去时间,怎么转换成这里面的TSO呢?(我是通过重新log truncate --until "过去时间“的方式不能改变这个文件,它总是记录曾经执行过的最大时间)