TIKV_REGION_STATUS只能使用第一个PD进行查询

Bug 反馈
PD集群有3个节点,测试机房宕机会丢失一个节点,查询数据表均正常,但是查看region分布时,执行sql select * from TIKV_REGION_STATUS;会失败,显示

是否查询这个表时,只会调度到第一个PD节点?
【 TiDB 版本】
v5.4.0
【 Bug 的影响】
PD第一个节点丢失,无法继续获得select * from TIKV_REGION_STATUS;表数据
【可能的问题复现步骤】

【看到的非预期行为】

【期望看到的行为】

【相关组件及具体版本】

【其他背景信息或者截图】
如集群拓扑,系统和内核版本,应用 app 信息等;如果问题跟 SQL 有关,请提供 SQL 语句和相关表的 Schema 信息;如果节点日志存在关键报错,请提供相关节点的日志内容或文件;如果一些业务敏感信息不便提供,请留下联系方式,我们与您私下沟通。

在执行这个查询之前, PD 的 leader 是否已经存在? 还是正在选举中?

还在同步元数据?选举中?

确实存在问题,不管挂的是不是Leader都能复现

  1. 测试了一下,问题可以复现。会打印以下错误栈。
    [2022/06/24 10:27:47.896 +08:00] [INFO] [conn.go:1115] [“command dispatched failed”] [conn=7] [connInfo=“id:7, addr:172.xxx.xx.136:55676 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select * from TIKV_REGION_STATUS”] [txn_mode=PESSIMISTIC] [err=“Get “http://172.xxx.xx.162:18279/pd/api/v1/regions”: dial tcp 172.xxx.xx.162:18279: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/helper.(*Helper).requestPD\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:813\ngithub.com/pingcap/tidb/store/helper.(*Helper).GetRegionsInfo\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:771\ngithub.com/pingcap/tidb/executor.(*memtableRetriever).setDataForTiKVRegionStatus\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:1449\ngithub.com/pingcap/tidb/executor.(*memtableRetriever).retrieve\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:141\ngithub.com/pingcap/tidb/executor.(*MemTableReaderExec).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/memtable_reader.go:118\ngithub.com/pingcap/tidb/executor.Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:286\ngithub.com/pingcap/tidb/executor.(*recordSet).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:149\ngithub.com/pingcap/tidb/server.(*tidbResultSet).Next\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/driver_tidb.go:312\ngithub.com/pingcap/tidb/server.(*clientConn).writeChunks\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2165\ngithub.com/pingcap/tidb/server.(*clientConn).writeResultset\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2116\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1994\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1841\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1336\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1091\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:548\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371”]
  2. 初步看是代码这里会访问到 down 掉的 PD 节点信息。
    for _, host := range pdHosts {
    req, err = http.NewRequest(method, util.InternalHTTPSchema()+"://"+host+uri, body)
    if err != nil {
    // Try to request from another PD node when some nodes may down.
    if strings.Contains(err.Error(), “connection refused”) {
    continue
    }
    return errors.Trace(err)
    }
    }
    if err != nil {
    return err
    }
    start := time.Now()
    resp, err := util.InternalHTTPClient().Do(req)
    if err != nil {
    return errors.Trace(err)
    }
  3. 提交了 issue https://github.com/pingcap/tidb/issues/35708

另外,如果真的遇到了 workaround是:
tiup ctl:v5.4.0 pd -uxxxxxxx. -I使用 pd control 命令查看down 掉节点的 member 信息
使用 member delete id 或者 member delete name 删除down 掉的节点。就可以继续访问了。
https://docs.pingcap.com/zh/tidb/stable/pd-control#pd-control-使用说明

2赞

嗯嗯,删掉故障的pd节点可以解决