h5n1
(H5n1)
1
【版本】OS:4.19.90-17.ky10.aarch64 、 k8s: v1.24.9 、operator: 1.4.0 、tidb 6.1.2
dyrnq/local-volume-provisioner:v2.5.0 (POD未调整时区 )
配置:
tiflash:
baseImage: 10.xxxx/zongbu-sre/tiflash-arm64:v6.1.2
replicas: 3
limits:
cpu: 12000m
memory: 16Gi
imagePullPolicy: IfNotPresent
storageClaims:
- resources:
requests:
storage: 500Gi
storageClassName: tiflash-storage
状态:
tidb-test-cluster-tiflash-0 4/4 Running 26 (6m28s ago) 3h36m
tidb-test-cluster-tiflash-1 4/4 Running 27 (5m59s ago) 3h36m
tidb-test-cluster-tiflash-2 4/4 Running 25 (11m ago) 3h36m
日志:
previous.txt (620.3 KB)
current.log (620.3 KB)
日志里面没有看到退出相关的日志,
describe一下呢,看看为什么?
h5n1
(H5n1)
3
也没啥信息,发完贴后 稳定了 20来分钟
tidb-test-cluster-tiflash-0 4/4 Running 26 (16m ago) 3h46m
tidb-test-cluster-tiflash-1 4/4 Running 27 (16m ago) 3h46m
tidb-test-cluster-tiflash-2 4/4 Running 25 (21m ago) 3h46m
Status: Running
IP: 172.16.228.157
IPs:
IP: 172.16.228.157
Controlled By: StatefulSet/tidb-test-cluster-tiflash
Init Containers:
init:
Container ID: containerd://3067f4400a71ad11516856dccdb2730aee11acf59209fb410e7c1a30f975c937
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
set -ex;ordinal=`echo ${POD_NAME} | awk -F- '{print $NF}'`;sed s/POD_NUM/${ordinal}/g /etc/tiflash/config_templ.toml > /data0/config.toml;sed s/POD_NUM/${ordinal}/g /etc/tiflash/proxy_templ.toml > /data0/proxy.toml
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 30 Jan 2023 11:26:57 +0800
Finished: Mon, 30 Jan 2023 11:26:57 +0800
Ready: True
Restart Count: 0
Environment:
POD_NAME: tidb-test-cluster-tiflash-2 (v1:metadata.name)
Mounts:
/data0 from data0 (rw)
/etc/tiflash from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Containers:
tiflash:
Container ID: containerd://4f01a3705a5e355f152addc929e5d71b8608212dedc2a60845c8166ab24358a7
Image: 10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2
Image ID: 10.172.49.246/zongbu-sre/tiflash-arm64@sha256:96f39d55b339c1b9e61f09fe8c8d6e0ef69add8557a0cf77a4b340d561f8c0aa
Ports: 3930/TCP, 20170/TCP, 9000/TCP, 8123/TCP, 9009/TCP, 8234/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/bin/sh
-c
/tiflash/tiflash server --config-file /data0/config.toml
State: Running
Started: Mon, 30 Jan 2023 14:57:11 +0800
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 30 Jan 2023 14:51:26 +0800
Finished: Mon, 30 Jan 2023 14:52:03 +0800
Ready: True
Restart Count: 25
Limits:
cpu: 12
memory: 16Gi
Requests:
cpu: 12
memory: 16Gi
Environment:
NAMESPACE: default (v1:metadata.namespace)
CLUSTER_NAME: tidb-test-cluster
HEADLESS_SERVICE_NAME: tidb-test-cluster-tiflash-peer
CAPACITY: 0
TZ: UTC
Mounts:
/data0 from data0 (rw)
/etc/podinfo from annotations (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
serverlog:
Container ID: containerd://3b48f767a9a2fcb893ce14902a2f822d2dc287cc2760fed926f9b4c202347eb6
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/server.log; tail -n0 -F /data0/logs/server.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
errorlog:
Container ID: containerd://072de3619af7b33017fae83e57253b56b464eb6b1d8e45d4fc95838bcd624cbb
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/error.log; tail -n0 -F /data0/logs/error.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
clusterlog:
Container ID: containerd://3c9fa1dc607d73c78ab72bd380c5bf6121cde586875c4963b41e9d4293579685
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/flash_cluster_manager.log; tail -n0 -F /data0/logs/flash_cluster_manager.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data0:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data0-tidb-test-cluster-tiflash-2
ReadOnly: false
annotations:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tidb-test-cluster-tiflash-3336363
Optional: false
kube-api-access-ccks8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 84m (x14 over 3h45m) kubelet Created container tiflash
Normal Started 84m (x14 over 3h45m) kubelet Started container tiflash
Warning BackOff 20m (x307 over 150m) kubelet Back-off restarting failed container
Normal Pulled 15m (x25 over 3h9m) kubelet Container image "10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2" already present on machine
h5n1
(H5n1)
4
稳定后重新跑tpch测试,观察到因为oom重启了,之前最初重启应该也是oom的问题,但是不知道中间为啥会一直crashloopback 持续了很长时间,现象就是 crashloopback持续时间较长之后running坚持几分钟 再次crash loop
tidb-test-cluster-tiflash-2 3/4 OOMKilled 25 (24m ago) 3h49m
tidb-test-cluster-tiflash-2 4/4 Running 26 (2s ago) 3h49m
tidb-test-cluster-tiflash-0 3/4 OOMKilled 26 (19m ago) 3h50m
tidb-test-cluster-tiflash-0 4/4 Running 27 (1s ago) 3h50m
tidb-test-cluster-tiflash-1 3/4 OOMKilled 27 (22m ago) 3h53m
这个得看 tiflash 的日志,你贴的日志中没看到相关信息。可以先调大内存试试,或者限制 tiflash 的内存使用。
h5n1
(H5n1)
7
我调大内存 确实没有 oom 都正常 ,只是之前 不断的crashloop不知道为啥
我有个 2T 的 tiflash ,故障后重启再也起不来了,日志里在启动过程中会走到一个 fatal 的位置。目前还没解决,我临时扩了一个tiflash解决了。
tiflash的代码了解不多,你这个如果不断crashloopback,可以看看日志中有没有什么线索。
h5n1
(H5n1)
关闭
9
此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。