pd出现["store does not have enough disk space"]后 ddl卡住

问题:
在磁盘出现空间不够的情况下,执行DDL卡住,无法kill和取消;且在磁盘空间清理后,仍然卡住(无法kill和取消),如图:

另外在/tmp目录下,出现很多临时文件:

drwx------ 3 admin admin 4.0K 2022-04-14 10:16:22 tidb-unistore-temp1166280911
drwx------ 3 admin admin 4.0K 2022-04-15 18:19:33 tidb-unistore-temp1189425191
drwx------ 3 admin admin 4.0K 2022-04-15 18:19:33 tidb-unistore-temp1251260923
drwx------ 3 admin admin 4.0K 2022-04-19 16:39:42 tidb-unistore-temp1341930054
drwx------ 3 admin admin 4.0K 2022-04-09 23:03:44 tidb-unistore-temp1410710097
drwx------ 3 admin admin 4.0K 2022-04-19 16:39:43 tidb-unistore-temp1634880386
drwx------ 3 admin admin 4.0K 2022-04-15 09:43:32 tidb-unistore-temp1644846419
drwx------ 3 admin admin 4.0K 2022-04-08 17:41:31 tidb-unistore-temp1653001776
drwx------ 3 admin admin 4.0K 2022-04-17 20:13:20 tidb-unistore-temp1702893701
drwx------ 3 admin admin 4.0K 2022-04-18 13:16:13 tidb-unistore-temp1711347467
drwx------ 3 admin admin 4.0K 2022-04-09 23:03:37 tidb-unistore-temp1738662889
drwx------ 3 admin admin 4.0K 2022-04-15 09:43:32 tidb-unistore-temp1815947099
drwx------ 3 admin admin 4.0K 2022-04-09 23:03:33 tidb-unistore-temp1821776732
drwx------ 3 admin admin 4.0K 2022-04-10 21:56:34 tidb-unistore-temp2039369652
drwx------ 3 admin admin 4.0K 2022-04-16 01:09:46 tidb-unistore-temp2057544414
drwx------ 3 admin admin 4.0K 2022-04-16 16:44:31 tidb-unistore-temp2079985504
TiDB版本:5.2.2

pd日志如下:

EndKey:{}"] [old-version=19081] [new-version=19082]
[2022/04/20 14:15:48.705 +08:00] [INFO] [cluster_worker.go:219] [“region batch split, generate new regions”] [region-id=2] [origin=“id:65001 start_key:"74800000000000E5FF4F00000000000000F8" end_key:"74800000000000E5FF5100000000000000F8" region_epoch:<conf_ver:1 version:19082 > peers:<id:65002 store_id:1 >”] [total=1]
[2022/04/20 14:15:49.283 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11704561664]
[2022/04/20 14:15:59.283 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11704274944]
[2022/04/20 14:16:09.284 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11704086528]
[2022/04/20 14:16:19.284 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11703832576]
[2022/04/20 14:16:29.285 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11703549952]
[2022/04/20 14:16:39.286 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11703394304]
[2022/04/20 14:16:49.286 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11703128064]
[2022/04/20 14:16:59.287 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11702849536]
[2022/04/20 14:17:09.288 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11702673408]
[2022/04/20 14:17:19.288 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11702423552]
[2022/04/20 14:17:29.289 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11702145024]
[2022/04/20 14:17:39.290 +08:00] [WARN] [cluster.go:536] [“store does not have enough disk space”] [store-id=1] [capacity=211242639360] [available=11701977088]

用这个取消ddl:
https://docs.pingcap.com/zh/tidb/dev/sql-statement-admin-cancel-ddl

job_id都没有,这个命令用不了

admin show ddl jobs查不到那个SQL么

看我截图,没有正在运行的job_id

那说明ddl已经执行完了吧,你是怎么判断它卡住的

没有啊,这个表是空表,你看processlsit都执行了6700多秒了

先kill掉94805吧,这么久了

kill不掉的,执行了kill tidb 94805后,sql还在,后面重启才没了

会不会是这个问题,实际已经kill掉了,显示有残留
https://github.com/pingcap/tidb/pull/29212

从admin show ddl jobs的状态来看,ddl肯定是执行完成了,参考官网文档说明:


对一个空表做字段更新的时间在秒级也符合预期。

我觉得是processlist这里出现问题了,kill不掉的话建议重启下tidb节点。