Ansible 滚动升级v3.0.0到v3.0.7失败

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:Release Version: v3.0.0-rc.1-359-gd977edf8a
  • 【问题描述】:ansible 滚动升级v3.0.0到v3.0.7失败
    提示:
    [192.168.3.5]: Ansible FAILED! => playbook: rolling_update.yml; TASK: wait until the TiKV port is up; message: {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 20160 is not up”}

附件中包含集群配置和日志
ansible.log (157.7 KB) fail.log (186 字节) inventory.ini (1.9 KB)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

补充下,再启动失败的这台tikv机器上看到的tikv.log最后的部分日志内容是

1927239 [2019/12/14 00:28:36.615 +00:00] [INFO] [raft.rs:723] ["[region 25041] 25043 became follower at term 6"] 1927240 [2019/12/14 00:28:36.615 +00:00] [INFO] [raft.rs:295] ["[region 25041] 25043 newRaft [peers: [25042, 25043, 25044], term: 6, commit: 30, applied: 30, last_index: 30, last_term: 6]"] 1927241 [2019/12/14 00:28:36.615 +00:00] [INFO] [store.rs:802] [“start store”] [takes=239.624368ms] [merge_count=0] [applying_count=0] [tombstone_count=2448] [region_count=6016] [store_id=5] 1927242 [2019/12/14 00:28:36.617 +00:00] [INFO] [store.rs:854] [“cleans up garbage data”] [takes=1.659998ms] [garbage_range_count=3569] [store_id=5] 1927243 [2019/12/14 00:28:36.659 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=split-check] 1927244 [2019/12/14 00:28:36.660 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=snapshot-worker] 1927245 [2019/12/14 00:28:36.660 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=raft-gc-worker] 1927246 [2019/12/14 00:28:36.660 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=compact-worker] 1927247 [2019/12/14 00:28:36.660 +00:00] [INFO] [future.rs:131] [“starting working thread”] [worker=pd-worker] 1927248 [2019/12/14 00:28:36.660 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=consistency-check] 1927249 [2019/12/14 00:28:36.660 +00:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=cleanup-sst] 1927250 [2019/12/14 00:28:36.660 +00:00] [WARN] [store.rs:1119] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”] 1927251 [2019/12/14 00:28:36.660 +00:00] [INFO] [node.rs:161] [“put store to PD”] [store=“id: 5 address: “192.168.3.5:20160” version: “3.0.7"”] 1927252 [2019/12/14 00:28:36.661 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927253 [2019/12/14 00:28:36.662 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927254 [2019/12/14 00:28:36.662 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927255 [2019/12/14 00:28:36.662 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927256 [2019/12/14 00:28:36.663 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927257 [2019/12/14 00:28:36.663 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927258 [2019/12/14 00:28:36.663 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927259 [2019/12/14 00:28:36.664 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927260 [2019/12/14 00:28:36.664 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927261 [2019/12/14 00:28:36.664 +00:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“version should compatible with version 4.0.0-alpha, got 3.0.7”) }))”] 1927262 [2019/12/14 00:28:36.664 +00:00] [FATAL] [server.rs:264] [“failed to start node: Other(”[src/pd/util.rs:335]: fail to request”)"]

  • 看 tikv 日志,看起来版本异常,初步判断是版本弄错了
    • 1.发下 tidb-ansible/resource/bin 下 tidb-server pd-server tikv-server ,-V 看下版本,是否都是 3.0.7
    • 2.确认下以前的版本是 3.0.0,还是 master 4.0 版本。我们是不能降级的
      1. pd-ctl 看下https://pingcap.com/docs-cn/stable/reference/tools/pd-control/,执行 config show all ,看下输出内容,里面有 version 关键字段,是集群 pd 录入的版本信息,看看是多少。

[tidb@i-txi8hajw bin]$ ./tidb-server -V Release Version: v3.0.7 Git Commit Hash: 84e4386c7a77d4b8df5db7f2303fb7fd3370eb9a Git Branch: HEAD UTC Build Time: 2019-12-04 10:08:24 GoVersion: go version go1.13 linux/amd64 Race Enabled: false TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306 Check Table Before Drop: false

[tidb@i-txi8hajw bin]$ ./pd-server -V Release Version: v3.0.7 Git Commit Hash: 7a5909ed3bae74d0c6c728ca931f240233aca03a Git Branch: HEAD UTC Build Time: 2019-12-04 10:06:16

[tidb@i-txi8hajw bin]$ ./tikv-server -V TiKV Release Version: 3.0.7 Git Commit Hash: ac6f02648a8c6ccb7ccafca20287e1b27007e4a0 Git Commit Branch: HEAD UTC Build Time: 2019-12-04 10:06:20 Rust Version: rustc 1.37.0-nightly (0e4a56b4b 2019-06-13)

升级前的版本 select tidb_version(); Release Version: v3.0.0-rc.1-359-gd977edf8a Git Commit Hash: d977edf8a39ccd1972d1fe1b17855b63b934a09d Git Branch: master UTC Build Time: 2019-07-19 11:50:55 GoVersion: go version go1.12 linux/amd64 Race Enabled: false TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e Check Table Before Drop: false

[tidb@i-txi8hajw bin]$ ./pd-ctl -i -u http://192.168.3.3:2379 » config show all { “client-urls”: “http://192.168.3.3:2379”, “peer-urls”: “http://192.168.3.3:2380”, “advertise-client-urls”: “http://192.168.3.3:2379”, “advertise-peer-urls”: “http://192.168.3.3:2380”, “name”: “pd_i-q13x5jo1”, “data-dir”: “/data1/deploy/data.pd”, “force-new-cluster”: false, “enable-grpc-gateway”: true, “initial-cluster”: “pd_i-7e6juaxj=http://192.168.3.2:2380,pd_i-q13x5jo1=http://192.168.3.3:2380,pd_i-espt2wt6=http://192.168.3.4:2380”, “initial-cluster-state”: “new”, “join”: “”, “lease”: 3, “log”: { “level”: “info”, “format”: “text”, “disable-timestamp”: false, “file”: { “filename”: “/data1/deploy/log/pd.log”, “log-rotate”: true, “max-size”: 300, “max-days”: 0, “max-backups”: 0 }, “development”: false, “disable-caller”: false, “disable-stacktrace”: false, “disable-error-verbose”: true, “sampling”: null }, “log-file”: “”, “log-level”: “”, “tso-save-interval”: “3s”, “metric”: { “job”: “pd_i-q13x5jo1”, “address”: “”, “interval”: “15s” }, “schedule”: { “max-snapshot-count”: 3, “max-pending-peer-count”: 16, “max-merge-region-size”: 20, “max-merge-region-keys”: 200000, “split-merge-interval”: “1h0m0s”, “enable-one-way-merge”: “false”, “patrol-region-interval”: “100ms”, “max-store-down-time”: “30m0s”, “leader-schedule-limit”: 4, “region-schedule-limit”: 4, “replica-schedule-limit”: 8, “merge-schedule-limit”: 8, “hot-region-schedule-limit”: 4, “hot-region-cache-hits-threshold”: 3, “store-balance-rate”: 15, “tolerant-size-ratio”: 5, “low-space-ratio”: 0.8, “high-space-ratio”: 0.6, “scheduler-max-waiting-operator”: 3, “disable-raft-learner”: “false”, “disable-remove-down-replica”: “false”, “disable-replace-offline-replica”: “false”, “disable-make-up-replica”: “false”, “disable-remove-extra-replica”: “false”, “disable-location-replacement”: “false”, “disable-namespace-relocation”: “false”, “schedulers-v2”: [ { “type”: “balance-region”, “args”: null, “disable”: false }, { “type”: “balance-leader”, “args”: null, “disable”: false }, { “type”: “hot-region”, “args”: null, “disable”: false }, { “type”: “label”, “args”: null, “disable”: false } ] }, “replication”: { “max-replicas”: 3, “location-labels”: “”, “strictly-match-label”: “false” }, “namespace”: {}, “pd-server”: { “use-region-storage”: “false” }, “cluster-version”: “4.0.0-alpha”, “quota-backend-bytes”: “0B”, “auto-compaction-mode”: “periodic”, “auto-compaction-retention-v2”: “1h”, “TickInterval”: “500ms”, “ElectionInterval”: “3s”, “PreVote”: true, “security”: { “cacert-path”: “”, “cert-path”: “”, “key-path”: “” }, “label-property”: {}, “WarningMsgs”: null, “namespace-classifier”: “table”, “LeaderPriorityCheckInterval”: “1m0s” }

“cluster-version”: “4.0.0-alpha” 这个位置,请教一下,升级时是通过 local_prepare.yaml 升级的 binary,还是通过下载 3.0.7 的 tidb-ansible,在 local_prepare.yml 做的操作呢?

你好,之前升级到3.0.0 和 这次从3.0.0升级到 最新的3.0.7 都是使用 tidb-ansible 滚动升级,

这次的升级命令过程是

ansible-playbook local_prepare.yml
ansible-playbook rolling_update.yml

  • “cluster-version”: “4.0.0-alpha”,pd 中记录的版本是 4.0.0-alpha,这个地方很奇怪。
  • 升级的时候用的什么剧本升级的?是否是 rolling_update.yml 剧本?
  • 如果确认,底层 TiKV 和 PD 之前的版本是 3.0 (滚动升级的时候,会将以前的 binary 包备份,可以确认下),可以用 pd-ctl 修改 cluster-version 为 3.0.7,然后重新滚动升级即可。

整个升级流程请严格按照官网操作,并且做好升级前检查。
https://pingcap.com/docs-cn/stable/how-to/upgrade/from-previous-version/

值得一提的是,在本次升级3.0.7之前,升级到3.0.0的时候,我 inventory.ini 配置文件 tidb_version = latest 会不会是,3.0.0升级的时候导致的“cluster-version”: “4.0.0-alpha”

这次升级3.0.7是按照文档操作的 用的是 rolling_update.yml 剧本 修改好 inventory.ini 里面的机器IP后 总共就执行了三个命令

git clone -b v3.0.7 https://github.com/pingcap/tidb-ansible.git
ansible-playbook local_prepare.yml
ansible-playbook rolling_update.yml

这次 升级v3.0.7,inventory.ini 中 tidb_version = v3.0.7,是默认值,没改

可以先确定下,PD 进程以及 TIKV 进程具体版本,确认下历史集群版本信息 ,在 deploy_dir/backup 目录下

失败的那台(192.168.3.5)TIVK 备份的

[tidb@i-akuf62dk backup]$ ./tikv-server.32659.2019-12-13@23:44:57~ -V
TiKV 4.0.0-alpha

失败的那台(192.168.3.5),总控机器推送过来的包版本

[tidb@i-akuf62dk bin]$ ./tikv-server -V
TiKV 
Release Version:   3.0.7
Git Commit Hash:   ac6f02648a8c6ccb7ccafca20287e1b27007e4a0
Git Commit Branch: HEAD
UTC Build Time:    2019-12-04 10:06:20
Rust Version:      rustc 1.37.0-nightly (0e4a56b4b 2019-06-13)

剩余的两台没升级的tikv当前的版本

[tidb@i-8zua5tea bin]$ ./tikv-server -V
TiKV 4.0.0-alpha

PD有三台,其中一台的信息是 备份的版本

[tidb@i-7e6juaxj backup]$ ./pd-server.15390.2019-12-13@23:44:05~ -V
Release Version: v4.0.0-alpha-9-gd6b53789
Git Commit Hash: d6b53789f9b54494ec4f9a0011114ecac4c20cfa
Git Branch: master
UTC Build Time:  2019-07-15 09:03:22

本次升级推送的最新版本

[tidb@i-7e6juaxj bin]$ ./pd-server -V
Release Version: v3.0.7
Git Commit Hash: 7a5909ed3bae74d0c6c728ca931f240233aca03a
Git Branch: HEAD
UTC Build Time:  2019-12-04 10:06:16

本次升级v.3.0.7前备份的上次升级的tidb-ansible/resources/bin下的 tidb ,pd,tikv的版本信息

[tidb@i-txi8hajw bin]$ ./tikv-server -V
TiKV 4.0.0-alpha
[tidb@i-txi8hajw bin]$ ./tidb-server -V
Release Version: v3.0.0-rc.1-359-gd977edf8a
Git Commit Hash: d977edf8a39ccd1972d1fe1b17855b63b934a09d
Git Branch: master
UTC Build Time: 2019-07-19 11:50:55
GoVersion: go version go1.12 linux/amd64
Race Enabled: false
TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e
Check Table Before Drop: false
[tidb@i-txi8hajw bin]$ ./pd-server -V
Release Version: v4.0.0-alpha-9-gd6b53789
Git Commit Hash: d6b53789f9b54494ec4f9a0011114ecac4c20cfa
Git Branch: master
UTC Build Time:  2019-07-15 09:03:22

这么看是不是升级3.0.0的时候 inventory.ini 配置文件 tidb_version = latest 导致的,pd 、tikv 和 tidb的版本不一致导致

我现在怎么办,从v4.0.0-alpha 到v3.0.7属于降级了。

而且,之前升级v3.0.0的时候怎么同步下来的pd和tikv包会是4.0的呢?

您之前做过升级操作,升级到 4.0 了吧?

历史版本看起来是 4.0 的包

可以确定下之前是否有过升级操作,现在看起来之前升级到 4.0 了,建议备份数据,重新部署 3.0.7。

您好之前是升级的,3.0.0,没想到 tidb-ansible 下载下来的 pd 和 tikv的包是 4.0的,而 下载的tidb的包确实3.0.0的,我觉得这个可能是 升级的BUG,下载的版本不一致。

另外我现在比较尴尬的是,数据也无法配备成功,因为tidb的内存是8个G,可能因为内存的大小的原因,用mydumper 全量备份数据,总是报错 Lost connection to MySQL server during query 我现在这个情况是否可以扩容一台 大配置的TIDB节点呢。 生产环境真怕起不来了

补充下,9月份升级3.0.0是tidb-ansible的日志看下可以帮忙看下为何下载的tidb-server 版本是3.0.0而下载的tikv-server版本和pd-server版本却是4.0ansible.log (314.3 KB)

用上次升级3.0.0的 tidb-ansible 重新滚动升级了一遍,把服务还原了,成功了。 但是备份数据还是失败 错误信息有两类 data: Lost connection to MySQL server during query 和 data: MySQL server has gone away 备份的语句是

./bin/mydumper -h 192.168.3.9 -P 4000 -u root -t 32 -F 64 -l 7200 -B db_name --skip-tz-utc -o ./var/prod

其中制定了 -l 7200 我TIDB机器的配置是8核8G 请问,这个情况是不是要,扩容一个高配的TIDB节点

重新执行老的 tidb-ansible 滚动升级后, ./pd-ctl -i -u http://192.168.3.3:2379 » config show all 结果

{
  "client-urls": "http://192.168.3.2:2379",
  "peer-urls": "http://192.168.3.2:2380",
  "advertise-client-urls": "http://192.168.3.2:2379",
  "advertise-peer-urls": "http://192.168.3.2:2380",
  "name": "pd_i-7e6juaxj",
  "data-dir": "/data1/deploy/data.pd",
  "force-new-cluster": false,
  "enable-grpc-gateway": true,
  "initial-cluster": "pd_i-7e6juaxj=http://192.168.3.2:2380,pd_i-q13x5jo1=http://192.168.3.3:2380,pd_i-espt2wt6=http://192.168.3.4:2380",
  "initial-cluster-state": "new",
  "join": "",
  "lease": 3,
  "log": {
    "level": "info",
    "format": "text",
    "disable-timestamp": false,
    "file": {
      "filename": "/data1/deploy/log/pd.log",
      "log-rotate": true,
      "max-size": 300,
      "max-days": 0,
      "max-backups": 0
    },
    "development": false,
    "disable-caller": false,
    "disable-stacktrace": false,
    "disable-error-verbose": true,
    "sampling": null
  },
  "log-file": "",
  "log-level": "",
  "tso-save-interval": "3s",
  "metric": {
    "job": "pd_i-7e6juaxj",
    "address": "",
    "interval": "15s"
  },
  "schedule": {
    "max-snapshot-count": 3,
    "max-pending-peer-count": 16,
    "max-merge-region-size": 20,
    "max-merge-region-keys": 200000,
    "split-merge-interval": "1h0m0s",
    "enable-one-way-merge": "false",
    "patrol-region-interval": "100ms",
    "max-store-down-time": "30m0s",
    "leader-schedule-limit": 4,
    "region-schedule-limit": 4,
    "replica-schedule-limit": 8,
    "merge-schedule-limit": 8,
    "hot-region-schedule-limit": 4,
    "hot-region-cache-hits-threshold": 3,
    "store-balance-rate": 15,
    "tolerant-size-ratio": 5,
    "low-space-ratio": 0.8,
    "high-space-ratio": 0.6,
    "scheduler-max-waiting-operator": 3,
    "disable-raft-learner": "false",
    "disable-remove-down-replica": "false",
    "disable-replace-offline-replica": "false",
    "disable-make-up-replica": "false",
    "disable-remove-extra-replica": "false",
    "disable-location-replacement": "false",
    "disable-namespace-relocation": "false",
    "schedulers-v2": [
      {
        "type": "balance-region",
        "args": null,
        "disable": false
      },
      {
        "type": "balance-leader",
        "args": null,
        "disable": false
      },
      {
        "type": "hot-region",
        "args": null,
        "disable": false
      },
      {
        "type": "label",
        "args": null,
        "disable": false
      }
    ]
  },
  "replication": {
    "max-replicas": 3,
    "location-labels": "",
    "strictly-match-label": "false"
  },
  "namespace": {},
  "pd-server": {
    "use-region-storage": "false"
  },
  "cluster-version": "4.0.0-alpha",
  "quota-backend-bytes": "0 B",
  "auto-compaction-mode": "periodic",
  "auto-compaction-retention-v2": "1h",
  "TickInterval": "500ms",
  "ElectionInterval": "3s",
  "PreVote": true,
  "security": {
    "cacert-path": "",
    "cert-path": "",
    "key-path": ""
  },
  "label-property": {},
  "WarningMsgs": null,
  "namespace-classifier": "table",
  "LeaderPriorityCheckInterval": "1m0s"
}