tiup 缩容tikv节点,tikv节点一直处于offline状态

版本号:

v4.0.0-rc.2

执行操作:

tiup cluster scale-in <cluster-name> --node 10.0.1.4:9000

问题:

1、tikv下线节点一直处于offline状态,即使没有业务了,

2、当有新的业务进来,此offline tikv节点还是会同步接收,pd监控上的offline-peer-region-count变多

你好,

可以根据此贴排查一下,目前 4.0 ga 已经发布,建议升级到 ga 版本。

谢谢回答

1、是tikv磁盘容量低于20%,造成不能清理

2、offline-peer-region是减少了,但是最后还剩10个offline-peer-region,不能够被清理,extra-peer-region和learn-peer-region也是10个

目前下线的tikv还是处于offline状态,pd监控里面的offline-peer-region还是10,extra和learn也一样

你好,

所以目前集群 tikv 节点是否证磁盘空间大于 20% 了呢

pd-ctl stroe 看下 offline 节点下线进度吧,可以使用 transfer region/leader 来快速迁移。

1、是大于20%

2、operator show中没有在迁移offline节点的region,offline-peer-region-count还有9个不被迁移,learn-peer-region-cout和extra-peer-region-count也是9个

感谢反馈,

尝试使用这两个命令快速转移下。

>> operator add transfer-leader 1 2                     // 把 Region 1 的 leader 调度到 store 2
>> operator add transfer-region 1 2 3 4                 // 把 Region 1 调度到 store 2,3,4

1、region check offline-peer

» region check offline-peer
{
  "count": 10,
  "regions": [
    {
      "id": 31889,
      "start_key": "7480000000000005FF2F5F728000000006FF7F364E0000000000FA",
      "end_key": "7480000000000005FF2F5F728000000006FF86888C0000000000FA",
      "epoch": {
        "conf_ver": 9,
        "version": 920
      },
      "peers": [
        {
          "id": 31890,
          "store_id": 1
        },
        {
          "id": 31891,
          "store_id": 4
        },
        {
          "id": 123995,
          "store_id": 116845
        },
        {
          "id": 175128,
          "store_id": 5,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 31890,
        "store_id": 1
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 94,
      "approximate_keys": 454572
    },
    {
      "id": 288028,
      "start_key": "7480000000000005FF4D5F698000000000FF000006013132332EFF3233352EFF313731FF2E33310000FD03C8FF00000004E5593B00FE",
      "end_key": "7480000000000005FF4D5F698000000000FF000006013132332EFF3233352EFF31392EFF3233330000FD03F8FF00000009D24D8200FE",
      "epoch": {
        "conf_ver": 9,
        "version": 881
      },
      "peers": [
        {
          "id": 288029,
          "store_id": 1
        },
        {
          "id": 288030,
          "store_id": 5
        },
        {
          "id": 288031,
          "store_id": 116845
        },
        {
          "id": 288032,
          "store_id": 4,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 288029,
        "store_id": 1
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 69,
      "approximate_keys": 957290
    },
    {
      "id": 125113,
      "start_key": "7480000000000005FF375F698000000000FF0000020155655A46FF4A525273FF755346FF4135493943FF7162FF37636C664163FF42FF52445369484B44FFFF0000000000000000FFF7038000000007B2FF6C5B000000000000F9",
      "end_key": "7480000000000005FF375F698000000000FF0000020155677A52FF75313455FF705969FF6141515570FF4F53FF334B4E627675FF44FF6D766C66363630FFFF0000000000000000FFF703800000003FE1FF18A0000000000000F9",
      "epoch": {
        "conf_ver": 15,
        "version": 821
      },
      "peers": [
        {
          "id": 125114,
          "store_id": 116845
        },
        {
          "id": 125115,
          "store_id": 1
        },
        {
          "id": 125116,
          "store_id": 5
        },
        {
          "id": 125117,
          "store_id": 4,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 125116,
        "store_id": 5
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 96,
      "approximate_keys": 900118
    },
    {
      "id": 126977,
      "start_key": "7480000000000005FF4D5F698000000000FF000006013132332EFF3233352EFF31392EFF3233330000FD03F8FF00000009D24D8200FE",
      "end_key": "7480000000000005FF4D5F698000000000FF000006013132332EFF3233352EFF313939FF2E34370000FD03F4FF000000011C577600FE",
      "epoch": {
        "conf_ver": 9,
        "version": 881
      },
      "peers": [
        {
          "id": 126978,
          "store_id": 1
        },
        {
          "id": 126979,
          "store_id": 5
        },
        {
          "id": 126980,
          "store_id": 116845
        },
        {
          "id": 126981,
          "store_id": 4,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 126978,
        "store_id": 1
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 36,
      "approximate_keys": 500935
    },
    {
      "id": 52577,
      "start_key": "7480000000000005FF375F698000000000FF000006013132332EFF3233352EFF323132FF2E34390000FD0380FF000000279194A000FE",
      "end_key": "7480000000000005FF375F698000000000FF000006013132332EFF3233352EFF323136FF2E32343100FE0380FF000000009FDA7B00FE",
      "epoch": {
        "conf_ver": 9,
        "version": 820
      },
      "peers": [
        {
          "id": 52579,
          "store_id": 4
        },
        {
          "id": 52580,
          "store_id": 5
        },
        {
          "id": 118131,
          "store_id": 116845
        },
        {
          "id": 142658,
          "store_id": 1,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 52580,
        "store_id": 5
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 40,
      "approximate_keys": 564361
    },
    {
      "id": 127722,
      "start_key": "7480000000000005FF4D5F698000000000FF0000010132353037FF31333734FF633336FF3730653765FF3731FF393265613235FF63FF61663263376430FFFF0000000000000000FFF703B0000000124BFF624E000000000000F9",
      "end_key": "7480000000000005FF4D5F698000000000FF0000010132353037FF31333734FF633336FF3730653765FF3731FF393265613235FF63FF61663263376430FFFF0000000000000000FFF703B00000001422FF0163000000000000F9",
      "epoch": {
        "conf_ver": 39,
        "version": 888
      },
      "peers": [
        {
          "id": 127723,
          "store_id": 116845
        },
        {
          "id": 127724,
          "store_id": 1
        },
        {
          "id": 127725,
          "store_id": 4
        },
        {
          "id": 127726,
          "store_id": 5,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 127725,
        "store_id": 4
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 96,
      "approximate_keys": 898800
    },
    {
      "id": 131043,
      "start_key": "7480000000000005FF4D5F698000000000FF0000010132353037FF31333734FF633336FF3730653765FF3731FF393265613235FF63FF61663263376430FFFF0000000000000000FFF703D800000016C9FF03C3000000000000F9",
      "end_key": "7480000000000005FF4D5F698000000000FF0000010132353037FF31333734FF633336FF3730653765FF3731FF393265613235FF63FF61663263376430FFFF0000000000000000FFF703D8000000188FFF99E7000000000000F9",
      "epoch": {
        "conf_ver": 90,
        "version": 891
      },
      "peers": [
        {
          "id": 131044,
          "store_id": 116845
        },
        {
          "id": 131045,
          "store_id": 5
        },
        {
          "id": 131046,
          "store_id": 4
        },
        {
          "id": 131047,
          "store_id": 1,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 131046,
        "store_id": 4
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 96,
      "approximate_keys": 898800
    },
    {
      "id": 125123,
      "start_key": "7480000000000005FF375F698000000000FF0000060133362E35FF362E3139FF382E31FF3136000000FC0380FF0000000BB1D06D00FE",
      "end_key": "7480000000000005FF375F698000000000FF0000060133362E35FF362E3230FF392E37FF3000000000FB0380FF0000002C9F64AB00FE",
      "epoch": {
        "conf_ver": 12,
        "version": 819
      },
      "peers": [
        {
          "id": 125124,
          "store_id": 1
        },
        {
          "id": 125125,
          "store_id": 116845
        },
        {
          "id": 125126,
          "store_id": 5
        },
        {
          "id": 125127,
          "store_id": 4,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 125124,
        "store_id": 1
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 70,
      "approximate_keys": 970181
    },
    {
      "id": 127387,
      "start_key": "7480000000000005FF4D5F72E400000013FF1FD2490000000000FA",
      "end_key": "7480000000000005FF4D5F72E400000013FFED84F80000000000FA",
      "epoch": {
        "conf_ver": 42,
        "version": 900
      },
      "peers": [
        {
          "id": 127388,
          "store_id": 116845
        },
        {
          "id": 127389,
          "store_id": 5
        },
        {
          "id": 127390,
          "store_id": 4
        },
        {
          "id": 142497,
          "store_id": 1,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 127390,
        "store_id": 4
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 113,
      "approximate_keys": 459702
    },
    {
      "id": 283457,
      "start_key": "7480000000000005FF375F698000000000FF0000020144496758FF416B3144FF376271FF5650575A31FF7437FF447475717854FF4CFF6F6E704151765AFFFF0000000000000000FFF703800000001706FF53EC000000000000F9",
      "end_key": "7480000000000005FF375F698000000000FF00000201444C3668FF71447466FF477166FF586A6D5663FF7375FF66556F357548FF63FF78584E41496158FFFF0000000000000000FFF703800000005044FF66D7000000000000F9",
      "epoch": {
        "conf_ver": 15,
        "version": 819
      },
      "peers": [
        {
          "id": 283458,
          "store_id": 116845
        },
        {
          "id": 283459,
          "store_id": 5
        },
        {
          "id": 283460,
          "store_id": 4
        },
        {
          "id": 283461,
          "store_id": 1,
          "is_learner": true
        }
      ],
      "leader": {
        "id": 283460,
        "store_id": 4
      },
      "written_bytes": 0,
      "read_bytes": 0,
      "written_keys": 0,
      "read_keys": 0,
      "approximate_size": 99,
      "approximate_keys": 928008
    }
  ]
}

2、store id

» store 116845
{
  "store": {
    "id": 116845,
    "address": "10.59.111.10:20160",
    "state": 1,
    "version": "4.0.0-rc.2",
    "status_address": "10.59.111.10:20180",
    "git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
    "start_timestamp": 1591858871,
    "deploy_path": "/data/tidb_deploy/tikv-20160/bin",
    "last_heartbeat": 1591869043196335312,
    "state_name": "Offline"
  },
  "status": {
    "capacity": "200GiB",
    "available": "180.8GiB",
    "used_size": "344.8MiB",
    "leader_count": 0,
    "leader_weight": 1,
    "leader_score": 0,
    "leader_size": 0,
    "region_count": 10,
    "region_weight": 1,
    "region_score": 809,
    "region_size": 809,
    "start_ts": "2020-06-11T15:01:11+08:00",
    "last_heartbeat_ts": "2020-06-11T17:50:43.196335312+08:00",
    "uptime": "2h49m32.196335312s"
  }
}

3、问题

1、我要下线的是116845节点,但是我发现116845已经没有leader region存在了

2、是不是由于有learn region存在,副本迁移不成功,导致tikv一直处于offline状态?

3、这种要怎么解决?

已经迁移走了,预期的

是由于 region 还没有迁移完成,所以该节点还处于 offline 状态。

如果当前可用的 TiKV 节点的空间充裕,同时 CPU 和 Mem 负载较低,可以考虑增加 replica-schedule-limitregion-schedule-limit 增加调度,可以控制同时进行 replica 调度的任务个数。

@户口舟亢

已经试过了,没用,operator show都没有调度的信息

1 个赞

1、手动迁移,由于有placement rule存在,需要先disable掉,才能执行命令;但由于有tiflash存在,placement rule还不能够disable,无解?

2、operator add transfer-region

» operator add transfer-region 126977 1 4 5
Failed! [500] "transfer region is not supported when placement rules enabled"
»

3、config placement-rules disable

» config placement-rules disable
Failed to set config: [400] "cannot disable placement rules with TiFlash nodes"

»

operator 的问题这边反馈下。

目前剩余的 10个 region 你可以调整参数后继续等待其 transfer 到其他节点

或者

手动关闭 offline tikv ,pd-ctl 中其状态会变为 disconnect,如果 30min 没有连接上该节点, raft 会自动补齐副本,完成后该节点将会变为 tombstone。

+1 大概是一样的问题。tidb 版本 v4.0.0

不过我这边是把tikv节点 暴力破坏掉,该节点 tikv-server一直没起来。导致该节点为”down“。

这边做的是缩容 再扩容的操作。

缩容后,该tikv节点 在 pd-ctl中显示为offline。看到有operator尝试 搬region,出现的几个region一直在重复,应该是搬不了。

等半小时后 看看效果…

补充下,有点尴尬了。

使用tiup display 看到 被破坏掉的tikv节点已经消失了,它的部署目录 也被tiup删掉了。

通过pd-ctl 发现 store id 还在,状态还是offline,region-count没有变化。同时,operator 还在尝试搬对应的region。

数据都没了…应该 是bug 吧。

现在,需要我手动删除对应的store id & operator 嘛?

@户口舟亢,第二点,我试试看

你好,

正常的下线流程就是这样,不知 试过了 指的是什么,帖子中只有调整参数,可以继续等待,或者手动下线,如果手动下线失败,可以详细描述下。

1、有一点是,为什么learner状态一直变不了follow状态,并且变不了follow状态还不被自动删除?

第一点,由于有learn region存在,所以没用

如果说的是 peer 中 is_learner: true,可以将其理解为,该 peer 将不会参加 leader 的选举与投票,在 offline 节点上,所以他也不会变成 follower 除非通过 api 将其状态置位 UP,节点下线是由调度控制的,方式和加速的方式上面已经提到。

不是在offline 节点上,可能是之前由于需要接收offline tikv上的副本,创建的,但始终还是处于learner 状态

116845是要offline的节点,store 1上的region是处于learner状态

"peers": [
        {
          "id": 127388,
          "store_id": 116845
        },
        {
          "id": 127389,
          "store_id": 5
        },
        {
          "id": 127390,
          "store_id": 4
        },
        {
          "id": 142497,
          "store_id": 1,
          "is_learner": true
        }
      ],