升級v4.0.9過程- TiDB節點無法啟動[10080: use of closed network connection] [mux: listener closed][http: Server closed]

Hi 顧問好,

版本: v4.0.8
各節點角色:
172.31.13.101 TiDB/PD
172.31.13.102 TiKV
172.31.13.103 TiKV
172.31.13.104 TiKV
172.31.13.105 Grafana,Promethesus,Alertmanager

我們升級到v4.0.9過程中,因出現不認識以下參數返回錯誤
config file conf/tidb.toml contained unknown configuration options: performance.memory-usage-alarm-ratio, performance.server-memory-quota

因此透過tiup cluster edit-config tidbcluster
將上述兩個參數先拿掉,然後重啟tidb節點
tiup cluster restart tidbcluster -R tidb

出現以下錯誤而無法啟動TiDB角色
/data/tidb-deploy/tidb-4000/log/tidb.log錯誤訊息內容:

[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:354] [“start status/rpc server error”] [error=“accept tcp [::]:10080: use of closed network connection”]
[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:344] [“grpc server error”] [error=“mux: listener closed”]
[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:349] [“http server error”] [error=“http: Server closed”]

/data/tidb-deploy/tidb-4000/log/tidb_stderr.log 錯誤訊息內容:

查看tidb.toml內容如下:

[log]
slow-query-file = “tidb-slow-overwrited.log”
slow-threshold = 300
[log.file]
max-days = 7

[tikv-client]
[tikv-client.copr-cache]
admission-max-result-mb = 10
admission-min-process-ms = 5
capacity-mb = 1000

Hi 顧問好,

後來我們將TiDB節點移除,想透過scale-out方式重新加入TiDB節點
但會有以下錯誤:
Error: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@172.31.13.101:22’ {ssh_stderr: , ssh_stdout: [2020/12/24 17:34:25.670 +08:00] [FATAL] [terror.go:257] [“unexpected error”] [error=“toml: cannot load TOML value of type int64 into a Go float”] [stack=“github.com/pingcap/parser/terror.MustNil\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:257\ github.com/pingcap/tidb/config.InitializeConfig\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/config/config.go:759\ main.main\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/tidb-server/main.go:165\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203”] [stack=“github.com/pingcap/parser/terror.MustNil\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:257\ github.com/pingcap/tidb/config.InitializeConfig\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/config/config.go:759\ main.main\ \t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/tidb-server/main.go:165\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203”]
, ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin /data/tidb-deploy/tidb-4000/bin/tidb-server --config-check --config=/data/tidb-deploy/tidb-4000/conf/tidb.toml }, cause: Process exited with status 1: check config failed

tidb-scale-out.yaml內容如下:


tidb_servers:
- host: 172.31.13.101
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: "/data/tidb-deploy/tidb-4000"
  log_dir: "/data/tidb-deploy/tidb-4000/log"
  #numa_node: "0,1"
  #The following configs are used to overwrite the `server_configs.tidb` values.
  config:
    log.slow-threshold: 300
    log.slow-query-file: "tidb-slow-overwrited.log"

看报错信息 应该是对应服务器的 tidb.toml
在 check-config 时候没有通过。
您可以登录到对应的 tidb server 上 对 tidb.toml进行检查。
另外看返回的报错信息 是 tidb 4.0.8 版本。和升级 4.0.9 是同一个问题吗?

Hi 北京大爺 顧問好,

[此環境因是migration real production data,再麻煩顧問幫忙看是否有機會修復,感謝!]
各節點角色:
172.31.13.101 TiDB/PD
172.31.13.102 TiKV
172.31.13.103 TiKV
172.31.13.104 TiKV
172.31.13.105 TiFlash,Grafana,Promethesus,Alertmanager
172.31.13.106 TiSpark

我們佈署的架構如下圖:


原本使用的TiDB版本為v4.0.8

之前因為查詢導致的OOM問題,嘗試過兩種方式加入以下參數:
performance.memory-usage-alarm-ratio: 0.8
performance.server-memory-quota: 34359738368

  1. tiup cluster edit-config tidbcluster

  2. 編輯tidb.toml

透過tiup cluster reload tidbcluster -R tidb,載入設定值。

昨天因看到有新版v4.0.9,才進行升級作業
tiup cluster upgrade tidbcluster v4.0.9

執行過程中,提示以下錯誤:
config file confg/tidb.toml contained unknown options: memory-usage-alarm-ratio, server-memory-quota

因此我們透過edit-config將此兩個參數拿掉,並重啟tidb角色節點
就導致TiDB無法成功啟動,查看LOG的訊息如下:
/data/tidb-deploy/tidb-4000/log/tidb.log錯誤訊息內容:

[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:354] [“start status/rpc server error”] [error=“accept tcp [::]:10080: use of closed network connection”]
[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:344] [“grpc server error”] [error=“mux: listener closed”]
[2020/12/24 15:23:45.286 +08:00] [ERROR] [http_status.go:349] [“http server error”] [error=“http: Server closed”]

最後實在想不出辦法,只好先將tidb角色節點先移除,想說是否可透過scale-out,新增一個乾淨的tidb節點進到cluster, 才出現以下錯誤:
Error: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@172.31.13.101:22’ {ssh_stderr: , ssh_stdout: [2020/12/24 17:34:25.670 +08:00] [FATAL] [terror.go:257] [“unexpected error”] [error=“toml: cannot load TOML value of type int64 into a Go float”] [stack=“github.com/pingcap/parser/terror.MustNil
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:257
github.com/pingcap/tidb/config.InitializeConfig
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/config/config.go:759
main.main
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/tidb-server/main.go:165
runtime.main
\t/usr/local/go/src/runtime/proc.go:203
”] [stack=“github.com/pingcap/parser/terror.MustNil
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:257
github.com/pingcap/tidb/config.InitializeConfig
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/config/config.go:759
main.main
\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/tidb-server/main.go:165
runtime.main
\t/usr/local/go/src/runtime/proc.go:203
”]
, ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin /data/tidb-deploy/tidb-4000/bin/tidb-server --config-check --config=/data/tidb-deploy/tidb-4000/conf/tidb.toml }, cause: Process exited with status 1: check config failed

tidb-scale-out.yaml內容如下:

tidb_servers:
- host: 172.31.13.101
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: "/data/tidb-deploy/tidb-4000"
  log_dir: "/data/tidb-deploy/tidb-4000/log"
  #numa_node: "0,1"
  #The following configs are used to overwrite the `server_configs.tidb` values.
  config:
    log.slow-threshold: 300
    log.slow-query-file: "tidb-slow-overwrited.log"

tidb.toml內容如下:
實在是看不太出來是哪個參數有轉型別問題

[log]
slow-query-file = “tidb-slow-overwrited.log”
slow-threshold = 300
[log.file]
max-days = 7

[tikv-client]
[tikv-client.copr-cache]
admission-max-result-mb = 10
admission-min-process-ms = 5
capacity-mb = 1000

@北京大爷
Hi 顧問好,

我們後來有找到問題了,應該是這兩個參數要為float,但目前TiUP scale-out xxx.yaml會設定
default值為int,才會導致 [error=“toml: cannot load TOML value of type int64 into a Go float”]

這部份可能要請你們看看是否能修改一下。

官網參數說明:

[log]
slow-query-file = “tidb-slow-overwrited.log”
slow-threshold = 300
[log.file]
max-days = 7

[tikv-client]
[tikv-client.copr-cache]
admission-max-result-mb = 10  => 10.0
admission-min-process-ms = 5
capacity-mb = 1000 => 1000.0

解決方式:
tidb-scale-out.yaml內容要將這兩個參數值指定為float
並且端口要先改為4001 & 10081
不然會噴以下錯誤


tidb_servers:
- host: 172.31.13.101
  ssh_port: 22
  port: **4001**
  status_port: 10081
  deploy_dir: **"/data/tidb-deploy/tidb-4001"**
  log_dir: **"/data/tidb-deploy/tidb-4001/log"**
  #numa_node: "0,1"
  #The following configs are used to overwrite the `server_configs.tidb` values.
  config:
    log.slow-threshold: 300
    log.slow-query-file: "tidb-slow-overwrited.log"
    tikv-client.copr-cache.admission-max-result-mb = 10.0
    tikv-client.copr-cache.capacity-mb = 1000.0

tiup cluster scale-out tidbcluster tidb-scale-out.yaml
這樣就能成功scale-out TiDB節點(port:4001,10081)

若要維持原本的 4000端口,則必須再寫一個scale-out.yaml是4000端口 拿掉原本4001端口的TiDB節點

最後我們再執行升級指令
tiup cluster upgrade tidbcluster v4.0.9
執行成功畫面:

关于端口冲突检测这块
Tiup 会在 TIup 中存储的 元数据进行 IP 与端口的检测。如果元数据中存在 同 ip 同端口的部署,那么会阻止后续 的扩容操作
关于 float 相关的问题 ,我这边会再确认下
感谢您的反馈

1 个赞

麻烦您再反馈下 Tiup 的 版本。

1 个赞

TiUP版本為v1.3.0
image

文档问题已经反馈,请关注 issue:https://github.com/pingcap/docs-cn/pull/5182

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。