手动配置组件间TLS失败

【 TiDB 使用环境】生产
【 TiDB 版本】6.3.0
【遇到的问题】手动配置TLS证书后reload,tikv、tidb启动失败
【复现路径】手动创建证书,tiup cluster edit-cofig
【问题现象及影响】tiup cluster reload 失败,部分tikv、tidb status变为 disconnected

【附件】:

  • 相关日志
    Log from tikv

[2022/11/02 01:41:38.893 +00:00] [INFO] [util.rs:551] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=10.250.87.38:2378]
[2022/11/02 01:41:38.893 +00:00] [INFO] [util.rs:589] [“connecting to PD endpoint”] [endpoints=10.250.87.122:2378]
[2022/11/02 01:41:38.895 +00:00] [INFO] [] ["subchannel 0x7faa87d35000 {address=ipv4:10.250.87.122:2378, args=grpc.client_channel_factory=0x7faa87c9c1f8, grpc.default_authority=10.250.87.122:2378, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7faa875c7c60, grpc.internal.security_connector=0x7faa7fe719b0…

Log from PD

[2022/11/03 02:21:45.797 +00:00] [INFO] [server.go:1406] [“start to watch pd leader”] [pd-leader="name:"pd-10.250.87.38-2378" member_id:16326710579290257846 peer_urls:"http://10.250.87.38:2380" client_urls:"http://10.250.87.38:2378" "]
[2022/11/03 02:21:45.799 +00:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {10.250.87.38:2378 0 }. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting…”]
[2022/11/03 02:21:45.833 +00:00] [WARN] [leadership.go:194] [“required revision has been compacted, use the compact revision”] [required-revision=305224] [compact-revision=713535]
[2022/11/03 02:21:46.800 +00:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {10.250.87.38:2378 0 }. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting…”]
[2022/11/03 02:21:48.461 +00:00] [WARN] [stream.go:436] [“lost TCP streaming connection with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=f5476fe9c527f5b9] [remote-peer-id=3310c14e027b4f49] [error=EOF]
[2022/11/03 02:21:48.463 +00:00] [WARN] [stream.go:436] [“lost TCP streaming connection with remote peer”] [stream-reader-type=“stream Message”] [local-member-id=f5476fe9c527f5b9] [remote-peer-id=3310c14e027b4f49] [error=EOF]
[2022/11/03 02:21:48.464 +00:00] [WARN] [peer_status.go:68] [“peer became inactive (message send to peer failed)”] [peer-id=3310c14e027b4f49] [error=“failed to dial 3310c14e027b4f49 on stream Message (peer 3310c14e027b4f49 failed to find local node f5476fe9c527f5b9)”]
[2022/11/03 02:21:48.566 +00:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {10.250.87.38:2378 0 }. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting…”]

  • 配置文件

server_configs:
tidb:
binlog.enable: false
binlog.ignore-error: false
log.file.max-days: 7
log.slow-threshold: 300
mem-quota-query: 524288000
oom-action: cancel
performance.txn-total-size-limit: 10730418240
security.cluster-ssl-ca: /vdb/tidb-certs/idtrca.cer
security.cluster-ssl-cert: /vdb/tidb-certs/server-cert.pem
security.cluster-ssl-key: /vdb/tidb-certs/server-key.pem
security.ssl-ca: /vdb/tidb-certs/idtrca.cer
security.ssl-cert: /vdb/tidb-certs/server-cert.pem
security.ssl-key: /vdb/tidb-certs/server-key.pem
tikv-client.copr-cache.enable: false
tikv:
readpool.coprocessor.use-unified-pool: true
readpool.storage.use-unified-pool: false
security.ca-path: /vdb/tidb-certs/idtrca.cer
security.cert-path: /vdb/tidb-certs/server-cert.pem
security.key-path: /vdb/tidb-certs/server-key.pem
pd:
auto-compaction-retention: 5m
log.file.max-days: 7
quota-backend-bytes: 17179869184
schedule.leader-schedule-limit: 4
schedule.region-schedule-limit: 2048
schedule.replica-schedule-limit: 64
security.cacert-path: /vdb/tidb-certs/idtrca.cer
security.cert-path: /vdb/tidb-certs/server-cert.pem
security.key-path: /vdb/tidb-certs/server-key.pem

https://docs.pingcap.com/zh/tidb/stable/check-before-deployment#手动配置-ssh-互信及-sudo-免密码

这两个内容你看一下~

这不是说了原因么。。。

你pd 配置文件 发下

你这个是认证失败的日志呀

我用的是tiup 上的edit-config,这个是pd的配置

pd:
auto-compaction-retention: 5m
log.file.max-days: 7
quota-backend-bytes: 17179869184
schedule.leader-schedule-limit: 4
schedule.region-schedule-limit: 2048
schedule.replica-schedule-limit: 64
security.cacert-path: /vdb/tidb-certs/idtrca.cer
security.cert-path: /vdb/tidb-certs/server-cert.pem
security.key-path: /vdb/tidb-certs/server-key.pem

那能否查看认证失败的具体原因呢?
因为我看到PD用的url仍然是http,感觉TLS没有开起来,但提示认证失败又是TLS的提示,想问一下这种怎么排查具体的认证失败原因?

把 PD,TiKV,TiDB 等组件配置中的 http:// 改成 https:// 看看呢?

是修改了 pd 的 tls 配置吗?
这里是 pd/etcd 的限制,在没有设置 tls 之前, pd/etcd 中的 peer 为 http 格式,设置 tls 后,该配置已经持久化到 etcd 中了,无法修改导致 etcd cluster 无法建立。 现象应该能看到 pd 在不停 crash。

处理方案:

  1. 删除 tls 配置,并使用 tiup 强制缩容 pd 节点数到 1
  2. 修改 pd 启动参数, 增加 --force-new-cluster ,在 scripts/run.sh 中
  3. 删除 --force-new-cluster ,重新启动 pd
  4. 此时集群已经恢复正常了

配置 tls 方案

  1. 使用 tiup cluster tls 命令进行配置
1 个赞

就是ssh免密配置的问题

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。