在k8s上扩容tiflash失败

【 TiDB 使用环境】测试环境
【 TiDB 版本】v6.0.0-dmr
【遇到的问题】
在k8s上扩容tiflash失败
【复现路径】
执行:

kubectl edit tc basic -n tidb-cluster

修改配置文件,增加代码段:

  tiflash:
    baseImage: pingcap/tiflash
    maxFailoverCount: 3
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi

增加后效果如下:


【问题现象及影响】
扩容后,pod一直起不来,查看日志报错为:

Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Exception: Cannot set max size of core file to 1073741824, e.what() = Exception

只有这一行,其他状态信息截图如下:




2赞

有更详细的报错日志吗?

2赞

请提供更多的信息。

我让韦神帮你看看~

1赞

是在k8s上部署的,日志就这一行,我也愁,control的日志要么?我没看到有关系的control日志,就没放,还有我是不是没找对地方,有什么特别提醒的地方的日志可以放的?

1赞

这个是 docker 没有设置 ulimit,需要在 service 文件里把 limit 调大,至少为报错中的值。参考下面的文档操作。
参考:
https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/prerequisites#docker-服务

3赞

我抓紧测试一下,稍后更新测试结果

1赞

我按照文档测试了一下,目前状态跟前述一致,问题没有解决。
修改后的各项指标如下:
docker服务加载后:


配置文件中的值按照建议配置成了报错的值:

宿主机的相关限制:

pod下 serverlog日志:
touch: /data0/logs/server.log: No such file or directory

2022/5/13 19:03:59 tail: can’t open ‘/data0/logs/server.log’: No such file or directory
pod下errorlog日志:
touch: /data0/logs/error.log: No such file or directory

2022/5/13 19:03:59 tail: can’t open ‘/data0/logs/error.log’: No such file or directory
pod下clusterlog日志:
touch: /data0/logs/flash_cluster_manager.log: No such file or directory

2022/5/13 19:04:00 tail: can’t open ‘/data0/logs/flash_cluster_manager.log’: No such file or directory
pod下init的日志:

  • echo basic-tiflash-2

2022/5/13 19:03:58 + awk -F- {print $NF}

2022/5/13 19:03:58 + ordinal=2

2022/5/13 19:03:58 + sed s/POD_NUM/2/g /etc/tiflash/config_templ.toml

2022/5/13 19:03:58 + sed s/POD_NUM/2/g /etc/tiflash/proxy_templ.toml
存储加载状态:


tidb-controller-manager日志:
container (1).log (23.1 KB)
configmap信息:
basic-tiflash-3664353.yaml (2.4 KB)
追:root soft stack 10240这个设置我后来验证是有风险的,容易让测试环境虚拟机起不来。。。

1赞

通过翻源码,设置docker下的service中的limit解决了这个问题。
根据报错信息翻源码看到:

struct rlimit rlim;
        if (getrlimit(RLIMIT_CORE, &rlim))
            throw Poco::Exception("Cannot getrlimit");
        /// 1 GiB by default. If more - it writes to disk too long.
        rlim.rlim_cur = config().getUInt64("core_dump.size_limit", 1024 * 1024 * 1024);

        if (setrlimit(RLIMIT_CORE, &rlim))
        {
            std::string message = "Cannot set max size of core file to " + std::to_string(rlim.rlim_cur);
#if !defined(ADDRESS_SANITIZER) && !defined(THREAD_SANITIZER) && !defined(MEMORY_SANITIZER) && !defined(SANITIZER)
            throw Poco::Exception(message);
#else
            /// It doesn't work under address/thread sanitizer. http://lists.llvm.org/pipermail/llvm-bugs/2013-April/027880.html
            std::cerr << message << std::endl;
#endif
        }

看到这明白了,这是设置 core dump file的时候,设置过大。
首先查找宿主机系统设置:

[root@host ~]# ulimit -c
unlimited

发现是无限制,那应该是pod中是否设置不正确,翻找dockerfile:

FROM hub.pingcap.net/tiflash/centos:7.9.2009-amd64

COPY misc /misc

RUN sh /misc/bake_llvm_base_amd64.sh

ENV PATH="/opt/cmake/bin:/usr/local/bin/:${PATH}:/usr/local/go/bin:/root/.cargo/bin" \
    LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:${LIBRARY_PATH}" \
    LD_LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:${LD_LIBRARY_PATH}" \
    CPLUS_INCLUDE_PATH="/usr/local/include/x86_64-unknown-linux-gnu/c++/v1/:${CPLUS_INCLUDE_PATH}" \
    CC=clang \
    CXX=clang++ \
    LD=ld.lld

USER root
WORKDIR /root/
ENV HOME /root/
ENV TZ Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
function bake_llvm_base_amd64() {
    export PATH="/opt/cmake/bin:/usr/local/bin/:${PATH}"
    export LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:${LIBRARY_PATH:+LIBRARY_PATH:}"
    export LD_LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:${LD_LIBRARY_PATH:+LD_LIBRARY_PATH:}"
    export CPLUS_INCLUDE_PATH="/usr/local/include/x86_64-unknown-linux-gnu/c++/v1/:${CPLUS_INCLUDE_PATH:+CPLUS_INCLUDE_PATH:}"
    SCRIPTPATH=$(cd $(dirname "$0"); pwd -P)

    # Basic Environment
    source $SCRIPTPATH/prepare_basic.sh
    prepare_basic
    
    # CMake
    source $SCRIPTPATH/install_cmake.sh
    install_cmake "3.22.1" "x86_64"

    # LLVM
    source $SCRIPTPATH/bootstrap_llvm.sh
    bootstrap_llvm "13.0.0"
    export CC=clang
    export CXX=clang++
    export LD=ld.lld

    # Go
    source $SCRIPTPATH/install_go.sh
    install_go "1.17" "amd64"
    export PATH="$PATH:/usr/local/go/bin"

    # Rust
    source $SCRIPTPATH/install_rust.sh
    install_rust 
    source $HOME/.cargo/env

    # ccache
    source $SCRIPTPATH/install_ccache.sh
    install_ccache "4.5.1"
}

bake_llvm_base_amd64

没有设置ulimit的地方。那能不能进pod看一下,首先依靠猜测设置降低core_dump.size_limit,
kubectl edit tc basic -n tidb-cluster
设置

  tiflash:
    baseImage: uhub.service.ucloud.cn/pingcap/tiflash
    config:
      config: |
        core_dump.size_limit = 1024
    maxFailoverCount: 3
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi

每起作用,然后直接修改configmap中的配置文件:
修改config_templ.toml,增加core_dump.size_limit = 1024,重启pod,然后tiflash正常启动。
进入pod查看image
这应该是启动不来的原因。
在/etc/systemd/system/docker.service.d/中增加:
limit-core.conf

[Service]
LimitCORE=infinity

systemctl daemon-reload
systemctl restart docker.service
删除core_dump.size_limit = 1024的限制,问题解决。

1赞