因单个tikv节点存储空间不足导致tidb服务无法启动

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:“3.1.0-beta.1”
  • 【问题描述】:
    某台tikv的报错日志如下:

[2020/08/07 06:13:13.939 +00:00] [FATAL] [server.rs:149] ["failed to create raft engine: RocksDb IO error: No space left on deviceWhile appending to file: /home/tidb/deploy/data/raft/623970.sst: No space left on device"]

手动查看了一下该tikv的剩余空间,还有20G。

使用ansible start.yml无法正常启动tidb服务。
对应的ansible.log 如下:

2020-08-08 05:58:18,094 p=14357 u=tidb |  [10.12.5.221]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}

2020-08-08 05:58:18,094 p=14357 u=tidb |  [10.12.5.114]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}

2020-08-08 05:58:18,094 p=14357 u=tidb |  [10.12.5.115]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}

pd节点10.12.5.114的日志如下:

[2020/08/08 05:58:04.228 +00:00] [ERROR] [grpclog.go:75] ["transport: Got too many pings from the client, closing the connection."]

[2020/08/08 05:58:04.228 +00:00] [ERROR] [grpclog.go:75] ["transport: loopyWriter.run returning. Err: transport: Connection closing"]

[2020/08/08 05:58:04.228 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.12.5.114:2379->10.12.5.114:39158: use of closed network connection"]

[2020/08/08 06:00:16.729 +00:00] [ERROR] [grpclog.go:75] ["transport: Got too many pings from the client, closing the connection."]

[2020/08/08 06:00:16.729 +00:00] [ERROR] [grpclog.go:75] ["transport: loopyWriter.run returning. Err: transport: Connection closing"]

[2020/08/08 06:00:16.729 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.12.5.114:2379->10.12.5.114:41068: use of closed network connection"]

[2020/08/08 06:00:22.973 +00:00] [ERROR] [grpclog.go:75] ["transport: Got too many pings from the client, closing the connection."]

[2020/08/08 06:00:22.973 +00:00] [ERROR] [grpclog.go:75] ["transport: loopyWriter.run returning. Err: transport: Connection closing"]

[2020/08/08 06:00:22.973 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.12.5.114:2379->10.12.5.115:33486: use of closed network connection"]

[2020/08/08 06:01:11.845 +00:00] [INFO] [index.go:189] ["compact tree index"] [revision=4748868]

[2020/08/08 06:01:11.859 +00:00] [INFO] [kvstore_compaction.go:55] ["finished scheduled compaction"] [compact-revision=4748868] [took=13.466169ms]

[2020/08/08 06:02:37.983 +00:00] [ERROR] [grpclog.go:75] ["transport: Got too many pings from the client, closing the connection."]

[2020/08/08 06:02:37.983 +00:00] [ERROR] [grpclog.go:75] ["transport: loopyWriter.run returning. Err: transport: Connection closing"]

[2020/08/08 06:02:37.983 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.12.5.114:2379->10.12.5.114:42938: use of closed network connection"]

麻烦提供一下:

  1. 无法启动的 tikv 节点 df -h 命令的结果
  2. 无法启动的 tidb 节点的 tidb.log 日志文件
  3. Overview -> PD 相关监控
  1. 无法启动的 tikv 节点 df -h 命令的结果

    tidb@nine:~$ df -h
    Filesystem Size Used Avail Use% Mounted on
    udev 7.9G 0 7.9G 0% /dev
    tmpfs 1.6G 1.2M 1.6G 1% /run
    /dev/sda2 885G 818G 23G 98% /
    tmpfs 7.9G 0 7.9G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
    /dev/loop1 97M 97M 0 100% /snap/core/9436
    /dev/loop0 97M 97M 0 100% /snap/core/9665
    tmpfs 1.6G 0 1.6G 0% /run/user/1001

  2. 无法启动的 tidb 节点的 tidb.log 日志文件

[2020/08/08 19:47:52.183 +08:00] [WARN] [client_batch.go:223] ["init create streaming fail"] [target=10.12.5.221:20160] [error="context deadline exceeded"]

[2020/08/08 19:47:52.183 +08:00] [INFO] [region_cache.go:937] ["mark store's regions need be refill"] [store=10.12.5.221:20160]

[2020/08/08 19:47:52.183 +08:00] [INFO] [region_cache.go:430] ["switch region peer to next due to send request fail"] [current="region ID: 73, meta: id:73 start_key:\"t\\200\\000\\000\\000\\000\\000\\000!\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000#\" region_epoch:<conf_ver:114 version:17 > peers:<id:15427350 store_id:484920 > peers:<id:17093307 store_id:407938 > , peer: id:17093307 store_id:407938 , addr: 10.12.5.221:20160, idx: 1"] [needReload=false] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\

github.com/pingcap/errors.AddStack
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174
github.com/pingcap/errors.Trace
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15
github.com/pingcap/tidb/store/tikv.sendBatchRequest
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client_batch.go:585
github.com/pingcap/tidb/store/tikv.(*rpcClient).SendRequest
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client.go:287
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:169
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:133
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTaskOnce
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:696
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTask
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:653
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:481
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

[2020/08/08 19:47:53.739 +08:00] [INFO] [region_cache.go:564] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=0] [regionID=73]

[2020/08/08 19:47:54.239 +08:00] [INFO] [region_cache.go:324] ["invalidate current region, because others failed on same store"] [region=73] [store=10.12.5.221:20160]

[2020/08/08 19:47:54.239 +08:00] [INFO] [coprocessor.go:754] ["[TIME_COP_PROCESS] resp_time:7.557614143s txnStartTS:418614386337513478 region_id:73 store_addr:10.12.5.223:20160 backoff_ms:39244 backoff_types:[regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss]"]

[2020/08/08 19:47:54.741 +08:00] [INFO] [region_cache.go:564] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=0] [regionID=73]

[2020/08/08 19:48:00.241 +08:00] [WARN] [client_batch.go:223] ["init create streaming fail"] [target=10.12.5.221:20160] [error="context deadline exceeded"]

[2020/08/08 19:48:00.241 +08:00] [INFO] [region_cache.go:937] ["mark store's regions need be refill"] [store=10.12.5.221:20160]

[2020/08/08 19:48:00.241 +08:00] [INFO] [region_cache.go:430] ["switch region peer to next due to send request fail"] [current="region ID: 73, meta: id:73 start_key:\"t\\200\\000\\000\\000\\000\\000\\000!\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000#\" region_epoch:<conf_ver:114 version:17 > peers:<id:15427350 store_id:484920 > peers:<id:17093307 store_id:407938 > , peer: id:17093307 store_id:407938 , addr: 10.12.5.221:20160, idx: 1"] [needReload=false] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\

github.com/pingcap/errors.AddStack
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174
github.com/pingcap/errors.Trace
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15
github.com/pingcap/tidb/store/tikv.sendBatchRequest
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client_batch.go:585
github.com/pingcap/tidb/store/tikv.(*rpcClient).SendRequest
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client.go:287
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:169
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:133
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTaskOnce
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:696
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTask
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:653
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:481
runtime.goexit
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

[2020/08/08 19:48:00.242 +08:00] [WARN] [backoff.go:304] ["tikvRPC backoffer.maxSleep 40000ms is exceeded, errors:\

epoch_not_match:<> at 2020-08-08T19:47:54.240007952+08:00
not leader: region_id:73 , ctx: region ID: 73, meta: id:73 start_key:"t\200\000\000\000\000\000\000!" end_key:"t\200\000\000\000\000\000\000#" region_epoch:<conf_ver:114 version:17 > peers:<id:15427350 store_id:484920 > peers:<id:17093307 store_id:407938 > , peer: id:15427350 store_id:484920 , addr: 10.12.5.223:20160, idx: 0 at 2020-08-08T19:47:54.741462916+08:00
send tikv request error: context deadline exceeded, ctx: region ID: 73, meta: id:73 start_key:"t\200\000\000\000\000\000\000!" end_key:"t\200\000\000\000\000\000\000#" region_epoch:<conf_ver:114 version:17 > peers:<id:15427350 store_id:484920 > peers:<id:17093307 store_id:407938 > , peer: id:17093307 store_id:407938 , addr: 10.12.5.221:20160, idx: 1, try next peer later at 2020-08-08T19:48:00.24215013+08:00"]

2020/08/08 19:48:00.242 terror.go:342: [fatal] [tikv:9005]Region is unavailable

github.com/pingcap/errors.AddStack

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174

github.com/pingcap/errors.Trace

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onSendFail

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:229

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:172

github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:133

github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTaskOnce

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:696

github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTask

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:653

github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run

/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:481

runtime.goexit

/usr/local/go/src/runtime/asm_amd64.s:1357
  1. Overview → PD 相关监控
    由于服务无法启动,不能从ganafa获取信息。
    从pd-ctl获得下述信息:
tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" member
{
  "header": {
    "cluster_id": 6807312755917103041
  },
  "members": [
    {
      "name": "pd_pd3",
      "member_id": 2579653654541892389,
      "peer_urls": [
        "http://10.12.5.115:2380"
      ],
      "client_urls": [
        "http://10.12.5.115:2379"
      ]
    },
    {
      "name": "pd_pd2",
      "member_id": 3717199249823848643,
      "peer_urls": [
        "http://10.12.5.114:2380"
      ],
      "client_urls": [
        "http://10.12.5.114:2379"
      ]
    },
    {
      "name": "pd_pd1",
      "member_id": 4691481983733508901,
      "peer_urls": [
        "http://10.12.5.113:2380"
      ],
      "client_urls": [
        "http://10.12.5.113:2379"
      ]
    }
  ],
  "leader": {
    "name": "pd_pd1",
    "member_id": 4691481983733508901,
    "peer_urls": [
      "http://10.12.5.113:2380"
    ],
    "client_urls": [
      "http://10.12.5.113:2379"
    ]
  },
  "etcd_leader": {
    "name": "pd_pd1",
    "member_id": 4691481983733508901,
    "peer_urls": [
      "http://10.12.5.113:2380"
    ],
    "client_urls": [
      "http://10.12.5.113:2379"
    ]
  }
}
tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" store
{
  "count": 12,
  "stores": [
    {
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "215GiB",
        "leader_count": 8392,
        "leader_weight": 1,
        "leader_score": 937689,
        "leader_size": 937689,
        "region_count": 17443,
        "region_weight": 1,
        "region_score": 843405288.182425,
        "region_size": 1967586,
        "start_ts": "2020-08-07T15:03:27Z",
        "last_heartbeat_ts": "2020-08-08T11:51:59.300228709Z",
        "uptime": "20h48m32.300228709s"
      }
    },
    {
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "244.7GiB",
        "leader_count": 7887,
        "leader_weight": 1,
        "leader_score": 938577,
        "leader_size": 938577,
        "region_count": 17611,
        "region_weight": 1,
        "region_score": 843185057.8181653,
        "region_size": 2013469,
        "start_ts": "2020-08-07T15:03:43Z",
        "last_heartbeat_ts": "2020-08-08T11:51:55.484376468Z",
        "uptime": "20h48m12.484376468s"
      }
    },
    {
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "215.2GiB",
        "leader_count": 8278,
        "leader_weight": 1,
        "leader_score": 937689,
        "leader_size": 937689,
        "region_count": 15276,
        "region_weight": 1,
        "region_score": 841956941.3070436,
        "region_size": 1711320,
        "start_ts": "2020-08-07T15:02:56Z",
        "last_heartbeat_ts": "2020-08-08T11:51:55.634135924Z",
        "uptime": "20h48m59.634135924s"
      }
    },
    {
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "215.2GiB",
        "leader_count": 9180,
        "leader_weight": 1,
        "leader_score": 938467,
        "leader_size": 938467,
        "region_count": 19311,
        "region_weight": 1,
        "region_score": 841943870.6920581,
        "region_size": 1956940,
        "start_ts": "2020-08-07T15:03:23Z",
        "last_heartbeat_ts": "2020-08-08T11:51:54.455860075Z",
        "uptime": "20h48m31.455860075s"
      }
    },
    {
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "244.6GiB",
        "leader_count": 9255,
        "leader_weight": 1,
        "leader_score": 938570,
        "leader_size": 938570,
        "region_count": 19089,
        "region_weight": 1,
        "region_score": 843320363.6070805,
        "region_size": 2004782,
        "start_ts": "2020-08-07T15:03:43Z",
        "last_heartbeat_ts": "2020-08-08T11:51:56.195575776Z",
        "uptime": "20h48m13.195575776s"
      }
    },
    {
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "215.2GiB",
        "leader_count": 8001,
        "leader_weight": 1,
        "leader_score": 938633,
        "leader_size": 938633,
        "region_count": 17089,
        "region_weight": 1,
        "region_score": 842085247.7981482,
        "region_size": 2001698,
        "start_ts": "2020-08-07T15:02:07Z",
        "last_heartbeat_ts": "2020-08-08T11:51:55.988218953Z",
        "uptime": "20h49m48.988218953s"
      }
    },
    {
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 1346,
        "region_weight": 1,
        "region_score": 16,
        "region_size": 16,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "210.5GiB",
        "leader_count": 9032,
        "leader_weight": 1,
        "leader_score": 937662,
        "leader_size": 937662,
        "region_count": 18397,
        "region_weight": 1,
        "region_score": 841403337.9166226,
        "region_size": 1910555,
        "start_ts": "2020-08-07T15:01:43Z",
        "last_heartbeat_ts": "2020-08-08T11:51:54.024402611Z",
        "uptime": "20h50m11.024402611s"
      }
    },
    {
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "215.2GiB",
        "leader_count": 9083,
        "leader_weight": 1,
        "leader_score": 937628,
        "leader_size": 937628,
        "region_count": 18707,
        "region_weight": 1,
        "region_score": 841935338.1643195,
        "region_size": 1936273,
        "start_ts": "2020-08-07T15:03:12Z",
        "last_heartbeat_ts": "2020-08-08T11:51:54.713647796Z",
        "uptime": "20h48m42.713647796s"
      }
    },
    {
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "245GiB",
        "leader_count": 7659,
        "leader_weight": 1,
        "leader_score": 938559,
        "leader_size": 938559,
        "region_count": 13842,
        "region_weight": 1,
        "region_score": 841210712.7905145,
        "region_size": 1632447,
        "start_ts": "2020-08-07T15:01:54Z",
        "last_heartbeat_ts": "2020-08-08T11:51:53.469129485Z",
        "uptime": "20h49m59.469129485s"
      }
    },
    {
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "203GiB",
        "leader_count": 9212,
        "leader_weight": 1,
        "leader_score": 938610,
        "leader_size": 938610,
        "region_count": 16542,
        "region_weight": 1,
        "region_score": 843726572.5291586,
        "region_size": 1629098,
        "start_ts": "2020-08-07T15:03:37Z",
        "last_heartbeat_ts": "2020-08-08T11:51:58.167808167Z",
        "uptime": "20h48m21.167808167s"
      }
    }
  ]
}
tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" health
[
  {
    "name": "pd_pd3",
    "member_id": 2579653654541892389,
    "client_urls": [
      "http://10.12.5.115:2379"
    ],
    "health": true
  },
  {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "client_urls": [
      "http://10.12.5.114:2379"
    ],
    "health": true
  },
  {
    "name": "pd_pd1",
    "member_id": 4691481983733508901,
    "client_urls": [
      "http://10.12.5.113:2379"
    ],
    "health": true
  }
]

麻烦再执行一下 df -i 看下 inode 的使用情况
inode 满了的话,也可能会报 no space left 的错误

看起来好像inode使用正常?

tidb@nine:~$ df -i
Filesystem       Inodes  IUsed    IFree IUse% Mounted on
udev            2045237    426  2044811    1% /dev
tmpfs           2053115    663  2052452    1% /run
/dev/sda2      58982400 230889 58751511    1% /
tmpfs           2053115      1  2053114    1% /dev/shm
tmpfs           2053115      4  2053111    1% /run/lock
tmpfs           2053115     18  2053097    1% /sys/fs/cgroup
/dev/loop1        12780  12780        0  100% /snap/core/9436
/dev/loop0        12862  12862        0  100% /snap/core/9665
tmpfs           2053115     11  2053104    1% /run/user/1001


tidb@nine:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G     0  7.9G   0% /dev
tmpfs           1.6G  1.2M  1.6G   1% /run
/dev/sda2       885G  818G   23G  98% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop1       97M   97M     0 100% /snap/core/9436
/dev/loop0       97M   97M     0 100% /snap/core/9665
tmpfs           1.6G     0  1.6G   0% /run/user/1001
  1. 这两个是什么? 能否试着把一些文件挪动到其他目录

  2. 请问您的操作系统是什么版本?

  1. 我这边的tikv节点使用的是虚拟机,我查看了其他多台tikv节点的虚拟机,均发现/dev/loopx占用率100%这一现象。所以认为/dev/loopx应该不影响tikv节点的使用。

  2. Ubuntu 18.04.4。

    tidb@nine:~$ uname -a

    Linux nine 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

    tidb@nine:~$ sudo lsb_release -a

    No LSB modules are available.

    Distributor ID: Ubuntu

    Description: Ubuntu 18.04.4 LTS

    Release: 18.04

    Codename: bionic

请问能否尝试重启下各个节点,之前有类似问题,是空间被日志撑满后,手工清理了日志,但是需要重启 PD,tikv 节点来恢复。 不知道您这边是否方便重启试试?

重启是指手动脚本重启pd,tikv吗?我之前尝试用ansible-playbook stop.yaml 和start.yml的方式,并未成功

  1. stop 是否能够正常停止?
  2. start 的时候,观察ansible 日志里 pd 是否正常启动了?
  3. 是否方便重启下,发送ansible中控机的/log目录下的ansible.log 日志,多谢。
  1. stop是正常的

  2. 从ansible.log上看,pd成功启动。对应的pd启动的日志情况如下:

  3. ansible.log如下:

tidb@three:~/tidb-ansible$ ansible-playbook start.yml 

PLAY [check config locally] *****************************************************************

TASK [check_config_static : Ensure only one monitoring host exists] *************************

TASK [check_config_static : Ensure monitored_servers exists] ********************************

TASK [check_config_static : Ensure TiDB host exists] ****************************************

TASK [check_config_static : Ensure PD host exists] ******************************************

TASK [check_config_static : Ensure TiKV host exists] ****************************************

TASK [check_config_static : Check ansible_user variable] ************************************

TASK [check_config_static : Ensure timezone variable is set] ********************************

TASK [check_config_static : Close old SSH control master processes] *************************
ok: [localhost]

TASK [check_config_static : Check ansible version] ******************************************

TASK [check_config_static : Check if jmespath is installed] *********************************
changed: [localhost]

TASK [check_config_static : Check if jinja2 is installed] ***********************************
changed: [localhost]

TASK [check_config_static : Preflight check - Fail when jmespath or jinja2 doesn't exist] ***

TASK [check_config_static : Get jmespath info] **********************************************
changed: [localhost]

TASK [check_config_static : Get jmespath version] *******************************************
ok: [localhost]

TASK [check_config_static : Get jinja2 info] ************************************************
changed: [localhost]

TASK [check_config_static : Get jinja2 version] *********************************************
ok: [localhost]

TASK [check_config_static : Preflight check - Fail when the versions of jmespath and jinja2 doesn't meet the requirements] ***

PLAY [gather all facts, and check dest] *****************************************************

TASK [check_config_dynamic : Set enable_binlog variable] ************************************

TASK [check_config_dynamic : Set deploy_dir if not set] *************************************

TASK [check_config_dynamic : environment check (deploy dir)] ********************************
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.232]
ok: [10.12.5.230]
ok: [10.12.5.226]
ok: [10.12.5.220]
ok: [10.12.5.221]
ok: [10.12.5.223]
ok: [10.12.5.224]
ok: [10.12.5.227]
ok: [10.12.5.229]
ok: [10.12.5.228]
ok: [10.12.5.231]
ok: [10.12.5.233]

TASK [check_config_dynamic : Preflight check - Does deploy dir have appropriate permission] ***

TASK [check_config_dynamic : environment check (supervise)] *********************************

TASK [check_config_dynamic : config skip variables (default)] *******************************
ok: [10.12.5.232]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.113]
ok: [10.12.5.230]
ok: [10.12.5.221]
ok: [10.12.5.220]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.224]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.231]
ok: [10.12.5.233]

TASK [check_config_dynamic : config skip variables] *****************************************
ok: [10.12.5.232]
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.230]
ok: [10.12.5.220]
ok: [10.12.5.221]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.224]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.231]
ok: [10.12.5.233]

TASK [check_config_dynamic : config skip variables] *****************************************

PLAY [monitored_servers] ********************************************************************

TASK [start node_exporter by supervise] *****************************************************

TASK [start node_exporter by systemd] *******************************************************
changed: [10.12.5.113]
changed: [10.12.5.115]
changed: [10.12.5.114]
changed: [10.12.5.230]
changed: [10.12.5.232]
changed: [10.12.5.220]
changed: [10.12.5.223]
changed: [10.12.5.221]
changed: [10.12.5.226]
changed: [10.12.5.224]
changed: [10.12.5.227]
changed: [10.12.5.228]
changed: [10.12.5.229]
changed: [10.12.5.231]
changed: [10.12.5.233]

TASK [wait until the node_exporter port is up] **********************************************
ok: [10.12.5.230]
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.232]
ok: [10.12.5.220]
ok: [10.12.5.221]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.224]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.233]
ok: [10.12.5.229]
ok: [10.12.5.231]

TASK [wait until the node_exporter metrics page is available] *******************************
ok: [10.12.5.113]
ok: [10.12.5.232]
ok: [10.12.5.230]
ok: [10.12.5.115]
ok: [10.12.5.114]
ok: [10.12.5.221]
ok: [10.12.5.220]
ok: [10.12.5.223]
ok: [10.12.5.224]
ok: [10.12.5.226]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.231]
ok: [10.12.5.233]

PLAY [monitored_servers] ********************************************************************

TASK [start blackbox_exporter by supervise] *************************************************

TASK [start blackbox_exporter by systemd] ***************************************************
changed: [10.12.5.113]
changed: [10.12.5.114]
changed: [10.12.5.115]
changed: [10.12.5.232]
changed: [10.12.5.230]
changed: [10.12.5.226]
changed: [10.12.5.221]
changed: [10.12.5.220]
changed: [10.12.5.223]
changed: [10.12.5.224]
changed: [10.12.5.227]
changed: [10.12.5.229]
changed: [10.12.5.228]
changed: [10.12.5.231]
changed: [10.12.5.233]

TASK [wait until the blackbox_exporter port is up] ******************************************
ok: [10.12.5.232]
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.230]
ok: [10.12.5.220]
ok: [10.12.5.221]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.224]
ok: [10.12.5.228]
ok: [10.12.5.227]
ok: [10.12.5.229]
ok: [10.12.5.233]
ok: [10.12.5.231]

TASK [wait until the blackbox_exporter metrics page is available] ***************************
ok: [10.12.5.232]
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]
ok: [10.12.5.230]
ok: [10.12.5.220]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.221]
ok: [10.12.5.224]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.231]
ok: [10.12.5.233]

PLAY [alertmanager_servers] *****************************************************************

TASK [start alertmanager by supervise] ******************************************************

TASK [start alertmanager by systemd] ********************************************************
changed: [10.12.5.232]

TASK [wait until the alertmanager port is up] ***********************************************
ok: [10.12.5.232]

PLAY [monitoring_servers] *******************************************************************

TASK [start pushgateway by supervise] *******************************************************

TASK [start pushgateway by systemd] *********************************************************
changed: [10.12.5.232] => (item=pushgateway-9091.service)

TASK [wait until the pushgateway port is up] ************************************************
ok: [10.12.5.232]

TASK [wait until the pushgateway metrics page is available] *********************************
ok: [10.12.5.232]

PLAY [monitoring_servers] *******************************************************************

TASK [start prometheus by supervise] ********************************************************

TASK [start prometheus by systemd] **********************************************************
changed: [10.12.5.232] => (item=prometheus-9090.service)

TASK [wait until the prometheus port is up] *************************************************
ok: [10.12.5.232]

TASK [wait until the prometheus metrics page is available] **********************************
ok: [10.12.5.232]

PLAY [kafka_exporter_servers] ***************************************************************
skipping: no hosts matched

PLAY [pd_servers] ***************************************************************************

TASK [start PD by supervise] ****************************************************************

TASK [start PD by systemd] ******************************************************************
changed: [10.12.5.113]
changed: [10.12.5.115]
changed: [10.12.5.114]

TASK [wait until the PD port is up] *********************************************************
ok: [10.12.5.113]
ok: [10.12.5.114]
ok: [10.12.5.115]

TASK [wait until the PD health page is available] *******************************************
ok: [10.12.5.113]
ok: [10.12.5.115]
ok: [10.12.5.114]

TASK [wait until the PD health page is available when enable_tls] ***************************

PLAY [tikv_servers] *************************************************************************

TASK [start TiKV by supervise] **************************************************************

TASK [start TiKV by systemd] ****************************************************************
changed: [10.12.5.220]
changed: [10.12.5.224]
changed: [10.12.5.230]
changed: [10.12.5.223]
changed: [10.12.5.221]
changed: [10.12.5.226]
changed: [10.12.5.227]
changed: [10.12.5.228]
changed: [10.12.5.229]
changed: [10.12.5.231]
changed: [10.12.5.233]

TASK [wait until the TiKV port is up] *******************************************************
ok: [10.12.5.220]
ok: [10.12.5.230]
ok: [10.12.5.224]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.231]
ok: [10.12.5.233]
ok: [10.12.5.223]
ok: [10.12.5.227]
ok: [10.12.5.226]
fatal: [10.12.5.221]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}

TASK [wait until the TiKV status page is available] *****************************************
ok: [10.12.5.220]
ok: [10.12.5.230]
ok: [10.12.5.224]
ok: [10.12.5.223]
ok: [10.12.5.226]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.229]
ok: [10.12.5.233]
ok: [10.12.5.231]

TASK [wait until the TiKV status page is available when enable_tls] *************************

TASK [command] ******************************************************************************
ok: [10.12.5.226]
ok: [10.12.5.230]
ok: [10.12.5.224]
ok: [10.12.5.220]
ok: [10.12.5.223]
ok: [10.12.5.227]
ok: [10.12.5.228]
ok: [10.12.5.231]
ok: [10.12.5.229]
ok: [10.12.5.233]

TASK [display new tikv pid] *****************************************************************
ok: [10.12.5.230] => {
    "msg": "tikv binary or docker pid: 21119"
}
ok: [10.12.5.220] => {
    "msg": "tikv binary or docker pid: 15214"
}
ok: [10.12.5.223] => {
    "msg": "tikv binary or docker pid: 19281"
}
ok: [10.12.5.224] => {
    "msg": "tikv binary or docker pid: 20678"
}
ok: [10.12.5.226] => {
    "msg": "tikv binary or docker pid: 22735"
}
ok: [10.12.5.227] => {
    "msg": "tikv binary or docker pid: 28460"
}
ok: [10.12.5.228] => {
    "msg": "tikv binary or docker pid: 5812"
}
ok: [10.12.5.229] => {
    "msg": "tikv binary or docker pid: 11679"
}
ok: [10.12.5.231] => {
    "msg": "tikv binary or docker pid: 5377"
}
ok: [10.12.5.233] => {
    "msg": "tikv binary or docker pid: 32437"
}

PLAY [pd_servers[0]] ************************************************************************

TASK [wait for region replication complete] *************************************************
ok: [10.12.5.113]

PLAY [pd_servers[0]] ************************************************************************

TASK [wait for region replication complete with tls enabled] ********************************

PLAY [pump_servers] *************************************************************************
skipping: no hosts matched

PLAY [tidb_servers] *************************************************************************

TASK [start TiDB by supervise] **************************************************************

TASK [start TiDB by systemd] ****************************************************************
changed: [10.12.5.114]
changed: [10.12.5.115]

TASK [wait until the TiDB port is up] *******************************************************
fatal: [10.12.5.115]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}
fatal: [10.12.5.114]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}
	to retry, use: --limit @/home/tidb/tidb-ansible/retry_files/start.retry

PLAY RECAP **********************************************************************************
10.12.5.113                : ok=13   changed=3    unreachable=0    failed=0   
10.12.5.114                : ok=13   changed=4    unreachable=0    failed=1   
10.12.5.115                : ok=13   changed=4    unreachable=0    failed=1   
10.12.5.220                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.221                : ok=10   changed=3    unreachable=0    failed=1   
10.12.5.223                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.224                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.226                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.227                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.228                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.229                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.230                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.231                : ok=14   changed=3    unreachable=0    failed=0   
10.12.5.232                : ok=17   changed=5    unreachable=0    failed=0   
10.12.5.233                : ok=14   changed=3    unreachable=0    failed=0   
localhost                  : ok=7    changed=4    unreachable=0    failed=0   


ERROR MESSAGE SUMMARY ***********************************************************************
[10.12.5.221]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiKV port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not up"}

[10.12.5.115]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}

[10.12.5.114]: Ansible FAILED! => playbook: start.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 4000 is not up"}

Ask for help:
Contact us: support@pingcap.com
It seems that you encounter some problems. You can send an email to the above email address, attached with the tidb-ansible/inventory.ini and tidb-ansible/log/ansible.log files and the error message, or new issue on https://github.com/pingcap/tidb-ansible/issues. We'll try our best to help you deploy a TiDB cluster. Thanks. :-)

麻烦您上传下,这个tikv这段时间从启动到失败的日志,多谢了。

221_tikv_1.log (4.3 MB)
221_tikv_2.log (3.9 MB)
221_tikv_3.log (4.8 MB)
221_tikv_4.log (4.9 MB)
221_tikv_5.log (2.0 MB)

  1. 看日志报错是 221_tikv_5, 在2020/08/07 05:51:16 重启后报这个错,请问当时发生了什么问题?
  2. 麻烦再检查下这个tikv data/db/.log 和 data/raft/.log 的总大小有多大 ,使用 du 命令查看。
  3. 麻烦反馈下 rocksdb log (包括kvdb和raftdb) 分别在 db 和 raft 目录下的log,麻烦上传包含最后一次启动报错的日志即可,多谢。
  1. 基本上是重复执行stop.yam和start.yml。在发现221存在空间不足(No space)时,删除了/deploy/log/下tikv.log.2020-0x-0x的文件,保留tikv.log。其余没有任何操作。

  2. LOG信息情况如下:

data/db

总大小 642 G

Log部分具体:

9.2M LOG
1.1G LOG.old.1592399505343559
1.1G LOG.old.1593770635022500
1.1G LOG.old.1595231880868536
171M LOG.old.1595571692493200
312M LOG.old.1595996184898344
28M LOG.old.1596092482201018
201M LOG.old.1596509877332629
134M LOG.old.1596812479051183
28M LOG.old.1597046353101839

data/raft

总大小 2.3G

Log部分具体:

3.2M LOG
128K LOG.old.1591509246573231
128K LOG.old.1591509995026446
128K LOG.old.1591513544508620
202M LOG.old.1595571673220541
24M LOG.old.1595996181587802
5.4M LOG.old.1596092475248244
29M LOG.old.1596509867861459
16M LOG.old.1596812476202831
8.9M LOG.old.1597046352073550

raft目录下:
raft_log_from_LOG.old.1596812476202831 (924.0 KB)

db目录下:

db_log_from_LOG.old.1596812479051183 (2.7 MB)

感谢,大小统计这里是 .log 不是 LOG. ,麻烦您再统计下,多谢 。

du -sh *.log

tidb@daslab:~/deploy/data/raft$ du -sh *.log

141M 519560.log


tidb@daslab:~/deploy/data/db$ du -sh *.log

141M 7232190.log

发的 raft_log_from_LOG.old.1596812476202831 不是完整版的。可以发一个完整的么?直接发 LOG 即可。

这个日志应该不是挂掉实例的日志?里面显示 rocksdb 8 月 7 日 14:59 才关闭,但是 TiKV 日志(211_tikv_5.log)显示这个实例至少在 8 月 7 日 5 点就开始不停重启了。

1 个赞

您好,仅需要反馈两个 LOG 即可,多谢。