DDL 被卡住了，创建数据库在队列里无法完成

zhengjunbo · 2020 年9 月 21 日 11:48

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：v4.0.4
【问题描述】：卡在了ddl上，目前队列中只有一个。

在TiDB日志中，创建不下来数据库。

[2020/09/21 11:38:54.505 +00:00] [INFO] [session.go:2130] ["CRUCIAL OPERATION"] [conn=719] [schemaVersion=258] [cur_db=] [sql="create database tt"] [user=root@10.204.11.89]
[2020/09/21 11:38:54.515 +00:00] [INFO] [ddl_worker.go:261] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:226, Type:create schema, State:none, SchemaState:none, SchemaID:225, TableID:0, RowCount:0, ArgLen:1, start time: 2020-09-21 11:38:54.487 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0; "]
[2020/09/21 11:38:54.515 +00:00] [INFO] [ddl.go:477] ["[ddl] start DDL job"] [job="ID:226, Type:create schema, State:none, SchemaState:none, SchemaID:225, TableID:0, RowCount:0, ArgLen:1, start time: 2020-09-21 11:38:54.487 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="create database tt"]
[2020/09/21 11:39:54.551 +00:00] [WARN] [expensivequery.go:168] [expensive_query] [cost_time=60.045631501s] [conn_id=719] [user=root] [txn_start_ts=0] [mem_max="0 Bytes (0 Bytes)"] [sql="create database tt"]
[2020/09/21 11:40:01.016 +00:00] [INFO] [client_batch.go:633] ["recycle idle connection"] [target=basic1-tikv-1.basic1-tikv-peer.test-namespace1.svc:20160]

目前在系统中只有一个jobs

MySQL [(none)]> admin show ddl jobs
    -> ;
+--------+------------+------------+---------------+--------------+-----------+----------+-----------+---------------------+---------------------+-----------+
| JOB_ID | DB_NAME    | TABLE_NAME | JOB_TYPE      | SCHEMA_STATE | SCHEMA_ID | TABLE_ID | ROW_COUNT | START_TIME          | END_TIME            | STATE     |
+--------+------------+------------+---------------+--------------+-----------+----------+-----------+---------------------+---------------------+-----------+
|    226 | tt         |            | create schema | none         |       225 |        0 |         0 | 2020-09-21 11:38:54 | NULL                | none      |
|    224 | tt         |            | create schema | none         |       223 |        0 |         0 | 2020-09-21 11:24:15 | 2020-09-21 11:26:27 | cancelled |
|    222 | tpcc       |            | create schema | none         |       221 |        0 |         0 | 2020-09-21 11:21:43 | 2020-09-21 11:22:17 | cancelled |
|    220 | warehouses |            | create schema | none         |       219 |        0 |         0 | 2020-09-21 11:16:14 | 2020-09-21 11:21:13 | cancelled |
|    218 | tpcc       |            | create schema | none         |       217 |        0 |         0 | 2020-09-21 11:13:23 | 2020-09-21 11:21:11 | cancelled |
|    216 | tpcc       |            | create schema | none         |       215 |        0 |         0 | 2020-09-21 11:12:18 | 2020-09-21 11:20:49 | cancelled |
|    214 | tpcc       |            | create schema | none         |       213 |        0 |         0 | 2020-09-21 10:15:04 | 2020-09-21 11:20:41 | cancelled |
|    212 | tpcc       |            | create schema | none         |       211 |        0 |         0 | 2020-09-21 10:12:32 | 2020-09-21 11:20:34 | cancelled |
|    210 | tpcc       |            | create schema | none         |       209 |        0 |         0 | 2020-09-21 10:12:14 | 2020-09-21 11:20:12 | cancelled |
|    208 | sbtest     | sbtest9    | add index     | public       |       111 |      173 |  10000000 | 2020-09-21 05:57:57 | 2020-09-21 07:06:57 | synced    |
|    207 | sbtest     | sbtest6    | add index     | public       |       111 |      191 |  10000000 | 2020-09-21 05:57:57 | 2020-09-21 07:02:38 | synced    |
+--------+------------+------------+---------------+--------------+-----------+----------+-----------+---------------------+---------------------+-----------+
11 rows in set (0.02 sec)

MySQL [(none)]> admin show ddl \G;
*************************** 1. row ***************************
   SCHEMA_VER: 258
     OWNER_ID: 766dd74f-bf39-4677-ae50-834c1c03845c
OWNER_ADDRESS: basic1-tidb-1.basic1-tidb-peer.test-namespace1.svc:4000
 RUNNING_JOBS: ID:226, Type:create schema, State:none, SchemaState:none, SchemaID:225, TableID:0, RowCount:0, ArgLen:0, start time: 2020-09-21 11:38:54.487 +0000 UTC, Err:<nil>, ErrCount:0, SnapshotVersion:0
      SELF_ID: 766dd74f-bf39-4677-ae50-834c1c03845c
        QUERY: create database tt
1 row in set (0.00 sec)

检查ddl jobs只有一个在创建的，卡不出卡在哪里了

MySQL [(none)]> admin show ddl jobs 10000 WHERE STATE not in ("cancelled","synced") ;
+--------+---------+------------+---------------+--------------+-----------+----------+-----------+---------------------+----------+-------+
| JOB_ID | DB_NAME | TABLE_NAME | JOB_TYPE      | SCHEMA_STATE | SCHEMA_ID | TABLE_ID | ROW_COUNT | START_TIME          | END_TIME | STATE |
+--------+---------+------------+---------------+--------------+-----------+----------+-----------+---------------------+----------+-------+
|    226 | tt      |            | create schema | none         |       225 |        0 |         0 | 2020-09-21 11:38:54 | NULL     | none  |
+--------+---------+------------+---------------+--------------+-----------+----------+-----------+---------------------+----------+-------+
1 row in set (0.04 sec)

zhengjunbo · 2020 年9 月 21 日 12:32

这个问题是使用pump下线命令将所有pump下线了，并且将副本数调为0，让operator控制pump po下完了之后无法写入数据导致。。

Lucien-卢西恩 · 2020 年9 月 21 日 12:39

后来关掉了 TiDBserver enable-binlog 了吗？

yilong · 2020 年9 月 21 日 12:51

如果确认下线 pump，请将 binlog 设置为 false， tiup 中写为 binlog.enable: false , 修改后，reload tidb。

image904×138 3.54 KB
副本数调整为 0 是什么意思？具体操作了什么命令？

zhengjunbo · 2020 年9 月 21 日 14:01

把pump恢复会去就好了

yilong · 2020 年9 月 22 日 02:08

zhengjunbo · 2020 年9 月 22 日 02:58

是在k8s集群上的tidb-operator操作的集群。将pump下线，自动关闭了之后发现有问题。

zhengjunbo · 2020 年9 月 22 日 02:59

到时候再把pump完全下线，将replicas调成0看下情况。

GangShen · 2020 年9 月 22 日 03:43

好的，如果有问题，欢迎反馈

zhengjunbo · 2020 年9 月 27 日 07:34

测了几轮，pump下不去，一下去就会卡住。

这一段加上去了依然无效。目前tidb那边的配置不会

移除 Pump 节点前，必须首先需要执行 `kubectl edit tc ${cluster_name} -n ${namespace}` **设置**其中的 `spec.tidb.binlogEnabled` 为 `false`，等待 tidb pod 完成重启更新后再移除 Pump 节点。

如果直接移除 Pump 节点会导致 TiDB 没有可以写入的 Pump 而无法使用。

无论加到config下，还是加到spec.tidb.binlogEnabled 下都无法解决，配置进去config中的不能自动重启tidb

加到config下也用了，tidb并不能把配置加进去

顺便再问个问题，要改tidb配置并且重启，怎么做

yilong · 2020 年9 月 27 日 08:35

请问和 DDL 卡住有关系吗？没有的话新提交一个问题到私有云/公有云的版本吧，k8s 来看下，多谢。

zhengjunbo · 2020 年9 月 27 日 08:36

都是一个问题，写不进去的问题

zhenjiaogao · 2020 年9 月 27 日 09:44

tidb 修改完配置后，要重启请参考下述命令：

https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/restart-a-tidb-cluster/

zhengjunbo · 2020 年9 月 27 日 09:55

按照上面配置的，发现有问题。

  pd:
    baseImage: harbor.fcbox.com/tidb/pingcap/pd
    replicas: 3
    storageClassName: pd-storage
    configUpdateStrategy: RollingUpdate
    enableDashboardInternalProxy: true
    requests:
      storage: "50Gi"
    config: {}
    annotations:
      tidb.pingcap.com/restartedAt: "202009271800"

zhenjiaogao · 2020 年9 月 27 日 10:19

是计划让 pd 的 pod 重启，但是没有生效，还是有其他问题？

zhengjunbo · 2020 年9 月 27 日 10:56

1、tidb-cluster.yaml

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: push-tidb
spec:
  version: v4.0.6
  timezone: Asia/Shanghai
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  discovery: {}
  pd:
    baseImage: harbor.fcbox.com/tidb/pingcap/pd
    replicas: 3
    storageClassName: pd-storage
    configUpdateStrategy: RollingUpdate
    enableDashboardInternalProxy: true
    requests:
      storage: "50Gi"
    config: {}
    annotations:
      tidb.pingcap.com/restartedAt: "202009271800"
  tikv:
    baseImage: harbor.fcbox.com/tidb/pingcap/tikv
    replicas: 3
    storageClassName: kv-storage
    requests:
      storage: "50Gi"
      cpu: 8
      memory: "45GB"
    limit:
      cpu: 8
      memory: "45GB"
    config:
      storage:
        block-cache:
          capacity: "32GB"
      readpool:
        storage:
          high-concurrency: 8
          normal-concurrency: 8
          low-concurrency: 8
  pump:
    baseImage: harbor.fcbox.com/tidb/pingcap/tidb-binlog
    version: v4.0.6
    replicas: 3
    storageClassName: pump-storage
    requests:
      storage: 10Gi
    schedulerName: default-scheduler
    config:
      addr: 0.0.0.0:8250
      gc: 7
      heartbeat-interval: 2
  tidb:
    baseImage: harbor.fcbox.com/tidb/pingcap/tidb
    replicas: 3
    slowLogTailer:
      image: harbor.fcbox.com/tidb/busybox:1.26.2
    storageClassName: tidb-storage
    binlogEnabled: false
    annotations:
      tidb.pingcap.com/restartedAt: "202009271800"
    requests:
      storage: "1Gi"
    service:
      type: ClusterIP
    config:
      binlog:
        enable: true

2、需要让pd重启生效，按照文档修改了之后发现不生效

zhengjunbo · 2020 年9 月 28 日 02:18

按照官网的配置直接报错了。。。还是语法错误

zhenjiaogao · 2020 年9 月 28 日 02:54

确认下，当前使用的 tidb-operator 的版本是什么？

zhengjunbo · 2020 年9 月 28 日 03:04

1.14

目前卡在已有集群修改成，滚动升级问题上了

zhenjiaogao · 2020 年9 月 28 日 03:24

好的，了解。

另外，上面要关闭 tidb server 的 binlog 参数，已经成功关闭，并且也能够通过 restartedAt 来重启一组 pod ，如 pd 了吗？