k8s 上tiflash不断重启

【版本】OS:4.19.90-17.ky10.aarch64 、 k8s: v1.24.9 、operator: 1.4.0 、tidb 6.1.2
dyrnq/local-volume-provisioner:v2.5.0 (POD未调整时区 )

配置:

  tiflash:
    baseImage: 10.xxxx/zongbu-sre/tiflash-arm64:v6.1.2
    replicas: 3
    limits:
      cpu: 12000m
      memory: 16Gi
    imagePullPolicy: IfNotPresent
    storageClaims:
      - resources:
          requests:
            storage: 500Gi
        storageClassName: tiflash-storage

状态:

tidb-test-cluster-tiflash-0                    4/4     Running   26 (6m28s ago)   3h36m
tidb-test-cluster-tiflash-1                    4/4     Running   27 (5m59s ago)   3h36m
tidb-test-cluster-tiflash-2                    4/4     Running   25 (11m ago)     3h36m

日志:

previous.txt (620.3 KB)

current.log (620.3 KB)

日志里面没有看到退出相关的日志,
describe一下呢,看看为什么?

也没啥信息,发完贴后 稳定了 20来分钟

tidb-test-cluster-tiflash-0                    4/4     Running   26 (16m ago)   3h46m
tidb-test-cluster-tiflash-1                    4/4     Running   27 (16m ago)   3h46m
tidb-test-cluster-tiflash-2                    4/4     Running   25 (21m ago)   3h46m
Status:       Running
IP:           172.16.228.157
IPs:
  IP:           172.16.228.157
Controlled By:  StatefulSet/tidb-test-cluster-tiflash
Init Containers:
  init:
    Container ID:  containerd://3067f4400a71ad11516856dccdb2730aee11acf59209fb410e7c1a30f975c937
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      set -ex;ordinal=`echo ${POD_NAME} | awk -F- '{print $NF}'`;sed s/POD_NUM/${ordinal}/g /etc/tiflash/config_templ.toml > /data0/config.toml;sed s/POD_NUM/${ordinal}/g /etc/tiflash/proxy_templ.toml > /data0/proxy.toml
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 30 Jan 2023 11:26:57 +0800
      Finished:     Mon, 30 Jan 2023 11:26:57 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:  tidb-test-cluster-tiflash-2 (v1:metadata.name)
    Mounts:
      /data0 from data0 (rw)
      /etc/tiflash from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Containers:
  tiflash:
    Container ID:  containerd://4f01a3705a5e355f152addc929e5d71b8608212dedc2a60845c8166ab24358a7
    Image:         10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2
    Image ID:      10.172.49.246/zongbu-sre/tiflash-arm64@sha256:96f39d55b339c1b9e61f09fe8c8d6e0ef69add8557a0cf77a4b340d561f8c0aa
    Ports:         3930/TCP, 20170/TCP, 9000/TCP, 8123/TCP, 9009/TCP, 8234/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
      -c
      /tiflash/tiflash server --config-file /data0/config.toml
    State:          Running
      Started:      Mon, 30 Jan 2023 14:57:11 +0800
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 30 Jan 2023 14:51:26 +0800
      Finished:     Mon, 30 Jan 2023 14:52:03 +0800
    Ready:          True
    Restart Count:  25
    Limits:
      cpu:     12
      memory:  16Gi
    Requests:
      cpu:     12
      memory:  16Gi
    Environment:
      NAMESPACE:              default (v1:metadata.namespace)
      CLUSTER_NAME:           tidb-test-cluster
      HEADLESS_SERVICE_NAME:  tidb-test-cluster-tiflash-peer
      CAPACITY:               0
      TZ:                     UTC
    Mounts:
      /data0 from data0 (rw)
      /etc/podinfo from annotations (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  serverlog:
    Container ID:  containerd://3b48f767a9a2fcb893ce14902a2f822d2dc287cc2760fed926f9b4c202347eb6
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/server.log; tail -n0 -F /data0/logs/server.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  errorlog:
    Container ID:  containerd://072de3619af7b33017fae83e57253b56b464eb6b1d8e45d4fc95838bcd624cbb
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/error.log; tail -n0 -F /data0/logs/error.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  clusterlog:
    Container ID:  containerd://3c9fa1dc607d73c78ab72bd380c5bf6121cde586875c4963b41e9d4293579685
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/flash_cluster_manager.log; tail -n0 -F /data0/logs/flash_cluster_manager.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  data0:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data0-tidb-test-cluster-tiflash-2
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tidb-test-cluster-tiflash-3336363
    Optional:  false
  kube-api-access-ccks8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Created  84m (x14 over 3h45m)  kubelet  Created container tiflash
  Normal   Started  84m (x14 over 3h45m)  kubelet  Started container tiflash
  Warning  BackOff  20m (x307 over 150m)  kubelet  Back-off restarting failed container
  Normal   Pulled   15m (x25 over 3h9m)   kubelet  Container image "10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2" already present on machine

稳定后重新跑tpch测试,观察到因为oom重启了,之前最初重启应该也是oom的问题,但是不知道中间为啥会一直crashloopback 持续了很长时间,现象就是 crashloopback持续时间较长之后running坚持几分钟 再次crash loop

tidb-test-cluster-tiflash-2                    3/4     OOMKilled   25 (24m ago)   3h49m
tidb-test-cluster-tiflash-2                    4/4     Running     26 (2s ago)    3h49m
tidb-test-cluster-tiflash-0                    3/4     OOMKilled   26 (19m ago)   3h50m
tidb-test-cluster-tiflash-0                    4/4     Running     27 (1s ago)    3h50m
tidb-test-cluster-tiflash-1                    3/4     OOMKilled   27 (22m ago)   3h53m

OOM了,内存调大一些吧。

这个得看 tiflash 的日志,你贴的日志中没看到相关信息。可以先调大内存试试,或者限制 tiflash 的内存使用。

我调大内存 确实没有 oom 都正常 ,只是之前 不断的crashloop不知道为啥

我有个 2T 的 tiflash ,故障后重启再也起不来了,日志里在启动过程中会走到一个 fatal 的位置。目前还没解决,我临时扩了一个tiflash解决了。

tiflash的代码了解不多,你这个如果不断crashloopback,可以看看日志中有没有什么线索。