新加入tikv节点后region均衡及修改副本数导致region unavailable

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.1 beta
  • 【问题描述】:
  1. 之前遇到断电导致某一tikv出问题,已经将该tikv的region进行了转移。详细见https://asktug.com/t/topic/35012。

  2. 后续将该损坏的tikv所在节点数据清空,重新以新tikv节点的形式加入。发现其region均衡速度非常慢。
    对应的配置:

{
  "replication": {
    "location-labels": "",
    "max-replicas": 2,
    "strictly-match-label": "false"
  },
  "schedule": {
    "disable-location-replacement": "false",
    "disable-make-up-replica": "false",
    "disable-namespace-relocation": "false",
    "disable-raft-learner": "false",
    "disable-remove-down-replica": "false",
    "disable-remove-extra-replica": "false",
    "disable-replace-offline-replica": "false",
    "enable-one-way-merge": "false",
    "high-space-ratio": 0.6,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 4,
    "leader-schedule-limit": 16,
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-pending-peer-count": 32,
    "max-snapshot-count": 32,
    "max-store-down-time": "30m0s",
    "merge-schedule-limit": 16,
    "patrol-region-interval": "100ms",
    "region-schedule-limit": 32,
    "replica-schedule-limit": 8,
    "scheduler-max-waiting-operator": 3,
    "schedulers-v2": [
      {
        "args": null,
        "disable": false,
        "type": "balance-region"
      },
      {
        "args": null,
        "disable": false,
        "type": "balance-leader"
      },
      {
        "args": null,
        "disable": false,
        "type": "hot-region"
      },
      {
        "args": null,
        "disable": false,
        "type": "label"
      },
      {
        "args": [
          "1"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "10"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "127001"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "151120"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "4"
        ],
        "disable": false,
        "type": "evict-leader"
      }
    ],
    "split-merge-interval": "1h0m0s",
    "store-balance-rate": 15,
    "tolerant-size-ratio": 5
  }
}
  1. 开始在工作人员的建议下,将副本从2–>3,但发现当前磁盘空间不满足该需求且耗时过长,故再此通过pd-ctl将副本数改为2.

  2. 已经均衡超过两天了,且观察各tikv的store size基本上稳定了。但目前在使用时,仍然频繁出现region is unavailable.

新加入的tikv节点为10.12.5.233,对应的监控信息如下:
screencapture-10-12-5-232-3000-d-eDbRZpnWk-test-cluster-overview-2020-06-07-12_47_32.pdf (592.3 KB)

info_garhering.py执行结果:
script_info.txt (5.4 KB)

现在重启tidb,tidb组件遇到下述问题:

[2020/06/07 13:20:36.542 +08:00] [INFO] [ddl_worker.go:114] ["[ddl] DDL worker closed"] [worker="worker 6, tp add index"] ["take time"=7.804µs]

[2020/06/07 13:20:36.543 +08:00] [INFO] [ddl_worker.go:114] ["[ddl] DDL worker closed"] [worker="worker 5, tp general"] ["take time"=5.455µs]

[2020/06/07 13:20:36.543 +08:00] [INFO] [manager.go:292] ["revoke session"] ["owner info"="[ddl] /tidb/ddl/fg/owner ownerManager 5afb9db1-c586-40e2-be07-8ac3ad4a0866"] []

[2020/06/07 13:20:36.543 +08:00] [INFO] [session_pool.go:85] ["[ddl] closing sessionPool"]

[2020/06/07 13:20:36.543 +08:00] [INFO] [delete_range.go:123] ["[ddl] closing delRange"]

[2020/06/07 13:20:36.543 +08:00] [INFO] [ddl.go:494] ["[ddl] DDL closed"] [ID=5afb9db1-c586-40e2-be07-8ac3ad4a0866] ["take time"=2.062006ms]

[2020/06/07 13:20:36.543 +08:00] [INFO] [ddl.go:405] ["[ddl] stop DDL"] [ID=5afb9db1-c586-40e2-be07-8ac3ad4a0866]

[2020/06/07 13:20:36.545 +08:00] [INFO] [domain.go:554] ["domain closed"] ["take time"=4.503117ms]

[2020/06/07 13:20:36.545 +08:00] [ERROR] [tidb.go:83] ["[ddl] init domain failed"] [error="[tikv:9005]Region is unavailable"] [errorVerbose="[tikv:9005]Region is unavailable\

github.com/pingcap/errors.AddStack
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174
github.com/pingcap/errors.Trace
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onRegionError
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:281
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:146
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:74
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:324
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:282
github.com/pingcap/tidb/structure.(*TxStructure).Get
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:35
github.com/pingcap/tidb/structure.(*TxStructure).GetInt64
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:44
github.com/pingcap/tidb/meta.(*Meta).GetSchemaVersion
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/meta/meta.go:173
github.com/pingcap/tidb/domain.(*Domain).loadInfoSchema
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/domain/domain.go:88
github.com/pingcap/tidb/domain.(*Domain).Reload
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/domain/domain.go:341
github.com/pingcap/tidb/domain.(*Domain).Init
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/domain/domain.go:642
github.com/pingcap/tidb/session.(*domainMap).Get.func1
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/tidb.go:79
github.com/pingcap/tidb/util.RunWithRetry
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/util/misc.go:52
github.com/pingcap/tidb/session.(*domainMap).Get
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/tidb.go:71
github.com/pingcap/tidb/session.createSession
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1590
github.com/pingcap/tidb/session.BootstrapSession
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1499
main.createStoreAndDomain
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:210
main.main
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:172
runtime.main
\t/usr/local/go/src/runtime/proc.go:203
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357"] [stack=“github.com/pingcap/tidb/session.(*domainMap).Get.func1
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/tidb.go:83
github.com/pingcap/tidb/util.RunWithRetry
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/util/misc.go:52
github.com/pingcap/tidb/session.(*domainMap).Get
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/tidb.go:71
github.com/pingcap/tidb/session.createSession
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1590
github.com/pingcap/tidb/session.BootstrapSession
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1499
main.createStoreAndDomain
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:210
main.main
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:172
runtime.main
\t/usr/local/go/src/runtime/proc.go:203”]

我按照之前的方法,检查221上出错的region。其中221的store_id为407938。
使用下述命令:

./pd-ctl -u http://10.12.5.114:2379 -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(407938) then . else empty end) | length>=$total-length)}'

得到9029个输出,见文本407938_error.txt (396.0 KB) 。请问我该怎么做?

  1. 请反馈 pd-ctl 命令的 store ,member , health 信息

  2. 麻烦上传 detail-tikv 和 pd 监控信息,可以使用以下方法截取长图,多谢。

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

store:

{
  "count": 12,
  "stores": [
    {
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "293.5GiB",
        "leader_count": 735,
        "leader_weight": 1,
        "leader_score": 140684,
        "leader_size": 140684,
        "region_count": 8852,
        "region_weight": 1,
        "region_score": 367318047.60698414,
        "region_size": 1377771,
        "start_ts": "2020-06-07T11:36:05Z",
        "last_heartbeat_ts": "2020-06-07T11:36:15.573863356Z",
        "uptime": "10.573863356s"
      }
    },
    {
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "287.2GiB",
        "leader_count": 5825,
        "leader_weight": 1,
        "leader_score": 845873,
        "leader_size": 845873,
        "region_count": 8918,
        "region_weight": 1,
        "region_score": 366119896.3546481,
        "region_size": 1328065,
        "start_ts": "2020-06-07T07:05:48Z",
        "last_heartbeat_ts": "2020-06-07T11:36:32.308909258Z",
        "uptime": "4h30m44.308909258s"
      }
    },
    {
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "297.5GiB",
        "leader_count": 5318,
        "leader_weight": 1,
        "leader_score": 780601,
        "leader_size": 780601,
        "region_count": 10669,
        "region_weight": 1,
        "region_score": 343343504.05443144,
        "region_size": 1601398,
        "sending_snap_count": 1,
        "start_ts": "2020-06-07T07:05:51Z",
        "last_heartbeat_ts": "2020-06-07T07:20:48.27630685Z",
        "uptime": "14m57.27630685s"
      }
    },
    {
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "334.3GiB",
        "leader_count": 5649,
        "leader_weight": 1,
        "leader_score": 846732,
        "leader_size": 846732,
        "region_count": 11552,
        "region_weight": 1,
        "region_score": 365979651.44272137,
        "region_size": 1729517,
        "start_ts": "2020-06-07T07:05:55Z",
        "last_heartbeat_ts": "2020-06-07T11:36:30.543043762Z",
        "uptime": "4h30m35.543043762s"
      }
    },
    {
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "334.6GiB",
        "leader_count": 6580,
        "leader_weight": 1,
        "leader_score": 846371,
        "leader_size": 846371,
        "region_count": 13086,
        "region_weight": 1,
        "region_score": 364543475.21180296,
        "region_size": 1739572,
        "start_ts": "2020-06-07T07:05:53Z",
        "last_heartbeat_ts": "2020-06-07T11:36:39.106026849Z",
        "uptime": "4h30m46.106026849s"
      }
    },
    {
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "501.6GiB",
        "leader_count": 6313,
        "leader_weight": 1,
        "leader_score": 846027,
        "leader_size": 846027,
        "region_count": 6408,
        "region_weight": 1,
        "region_score": 872758,
        "region_size": 872758,
        "start_ts": "2020-06-07T07:05:48Z",
        "last_heartbeat_ts": "2020-06-07T11:36:31.222617339Z",
        "uptime": "4h30m43.222617339s"
      }
    },
    {
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "293.6GiB",
        "leader_count": 5719,
        "leader_weight": 1,
        "leader_score": 846561,
        "leader_size": 846561,
        "region_count": 11279,
        "region_weight": 1,
        "region_score": 366938610.56206703,
        "region_size": 1607267,
        "start_ts": "2020-06-07T07:05:48Z",
        "last_heartbeat_ts": "2020-06-07T11:36:31.57952138Z",
        "uptime": "4h30m43.57952138s"
      }
    },
    {
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "279.4GiB",
        "leader_count": 4571,
        "leader_weight": 1,
        "leader_score": 769420,
        "leader_size": 769420,
        "region_count": 7677,
        "region_weight": 1,
        "region_score": 353273588.98531103,
        "region_size": 1266073,
        "sending_snap_count": 1,
        "start_ts": "2020-06-07T07:05:49Z",
        "last_heartbeat_ts": "2020-06-07T07:19:38.72401233Z",
        "uptime": "13m49.72401233s"
      }
    },
    {
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "294.7GiB",
        "leader_count": 5973,
        "leader_weight": 1,
        "leader_score": 846715,
        "leader_size": 846715,
        "region_count": 11765,
        "region_weight": 1,
        "region_score": 360616506.3073888,
        "region_size": 1576701,
        "start_ts": "2020-06-07T07:05:47Z",
        "last_heartbeat_ts": "2020-06-07T11:36:38.170372516Z",
        "uptime": "4h30m51.170372516s"
      }
    },
    {
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "293.7GiB",
        "leader_count": 5790,
        "leader_weight": 1,
        "leader_score": 846659,
        "leader_size": 846659,
        "region_count": 11609,
        "region_weight": 1,
        "region_score": 366700615.17977715,
        "region_size": 1642474,
        "start_ts": "2020-06-07T07:05:48Z",
        "last_heartbeat_ts": "2020-06-07T11:36:37.366948552Z",
        "uptime": "4h30m49.366948552s"
      }
    },
    {
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "334.1GiB",
        "leader_count": 6962,
        "leader_weight": 1,
        "leader_score": 846558,
        "leader_size": 846558,
        "region_count": 14537,
        "region_weight": 1,
        "region_score": 366919277.34399176,
        "region_size": 1717122,
        "start_ts": "2020-06-07T07:05:52Z",
        "last_heartbeat_ts": "2020-06-07T11:36:32.596293842Z",
        "uptime": "4h30m40.596293842s"
      }
    }
  ]
}

member:

{
  "header": {
    "cluster_id": 6807312755917103041
  },
  "members": [
    {
      "name": "pd_pd3",
      "member_id": 2579653654541892389,
      "peer_urls": [
        "http://10.12.5.115:2380"
      ],
      "client_urls": [
        "http://10.12.5.115:2379"
      ]
    },
    {
      "name": "pd_pd2",
      "member_id": 3717199249823848643,
      "peer_urls": [
        "http://10.12.5.114:2380"
      ],
      "client_urls": [
        "http://10.12.5.114:2379"
      ]
    },
    {
      "name": "pd_pd1",
      "member_id": 4691481983733508901,
      "peer_urls": [
        "http://10.12.5.113:2380"
      ],
      "client_urls": [
        "http://10.12.5.113:2379"
      ]
    }
  ],
  "leader": {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "peer_urls": [
      "http://10.12.5.114:2380"
    ],
    "client_urls": [
      "http://10.12.5.114:2379"
    ]
  },
  "etcd_leader": {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "peer_urls": [
      "http://10.12.5.114:2380"
    ],
    "client_urls": [
      "http://10.12.5.114:2379"
    ]
  }
}

health:

[
  {
    "name": "pd_pd3",
    "member_id": 2579653654541892389,
    "client_urls": [
      "http://10.12.5.115:2379"
    ],
    "health": true
  },
  {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "client_urls": [
      "http://10.12.5.114:2379"
    ],
    "health": true
  },
  {
    "name": "pd_pd1",
    "member_id": 4691481983733508901,
    "client_urls": [
      "http://10.12.5.113:2379"
    ],
    "health": true
  }
]

[问题分析]

  1. 查看 store 和 config show all 信息:11 个 store ,8个 up,2个 disconnected ,1个 down; 副本数为 3;sync-log 为 false
  2. 其中一个 disconnected 的 store 日志频繁重启,日志报错如下,代表此region损坏

  1. 查看其他两个日志,没有明显报错,尝试重启这两个tikv。一个 up,一个还是disconnected
  2. 关闭报错明显的tikv,另一个disconnected 的 tikv 也为 up 状态。
  3. 将disconnected节点的损坏region设置为tombstone,启动节点

[解决方案]

store_221 down, store_227 正常节 , region id: 7077359

  1. 停止 221 和 227 实例

  2. 在 store_221 down 上把 region 置为 tombstone

221上执行
tikv-ctl --db /path/to/tikv/db tombstone -p pdip:pdport -r 7077359

  1. 在 227 正常 store 上 删除故障 peer

tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s store_221 -r 7077359

  1. 启动 221 和 227

目前已经能够正常使用了,后期会在空余时间完成副本数的增加和进行系统的升级。感谢工作人员的悉心帮助。

:handshake:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。