tikv节点中断,不能提供服务

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

tikv服务出现了中断,伴有大量的leader drop。同时还有很多leader选举的日志出现
从下面的日志能看出重建链接可以成功,但是不能通信。
麻烦看下是什么情况

[2020/04/22 03:23:00.161 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinis
hed(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”]
[2020/04/22 03:23:00.161 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcF
inished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”] [to_addr=10.9.99.96:20160]
[2020/04/22 03:23:00.161 +08:00] [INFO] [store.rs:1946] [“broadcasting unreachable”] [unreachable_store_id=5
] [store_id=59697]
[2020/04/22 03:23:00.167 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinis
hed(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”]
[2020/04/22 03:23:00.167 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcF
inished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”] [to_addr=10.9.16.130:20160]
[2020/04/22 03:23:00.180 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.99.96:20160 fail, the gRPC conne
ction could be broken”]
[2020/04/22 03:23:00.194 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/
raft_client.rs:216]: RaftClient send fail")”]
[2020/04/22 03:23:00.194 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.16.130:20160 fail, the gRPC conn
ection could be broken”]
[2020/04/22 03:23:00.202 +08:00] [INFO] [store.rs:1946] [“broadcasting unreachable”] [unreachable_store_id=2
429] [store_id=59697]
[2020/04/22 03:23:00.209 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/
raft_client.rs:216]: RaftClient send fail")”]
[2020/04/22 03:23:00.217 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.99.96:201
60] [store_id=5]
[2020/04/22 03:23:00.217 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a
ddr=10.9.99.96:20160]
[2020/04/22 03:23:00.219 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.16.130:20
160] [store_id=2429]
[2020/04/22 03:23:00.219 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a
ddr=10.9.16.130:20160]
[2020/04/22 03:23:00.719 +08:00] [WARN] [endpoint.rs:454] [error-response] [err=“region message: "peer is n
ot leader for region 135761, leader may Some(id: 135763 store_id: 2429)" not_leader { region_id: 135761 lea
der { id: 135763 store_id: 2429 } }”]
[2020/04/22 03:23:01.779 +08:00] [WARN] [endpoint.rs:454] [error-response] [err=“region message: "peer is n
ot leader for region 164706, leader may Some(id: 164708 store_id: 2429)" not_leader { region_id: 164706 lea
der { id: 164708 store_id: 2429 } }”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {"created":"@1587496983.223
140374","description":"Failed to connect to remote host: OS Error","errno":110,"file":"/rust/regis
try/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_li
ne":212,"os_error":"Connection timed out","syscall":"getsockopt(SO_ERROR)","target_address":"ipv
4:10.9.16.130:20160"}”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {"created":"@1587496983.223
120970","description":"Failed to connect to remote host: OS Error","errno":110,"file":"/rust/regis
try/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_li
ne":212,"os_error":"Connection timed out","syscall":"getsockopt(SO_ERROR)","target_address":"ipv
4:10.9.99.96:20160"}”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:758] [“Subchannel 0x7fa874396600: Retry immediately”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:758] [“Subchannel 0x7fa874396a00: Retry immediately”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:719] [“Failed to connect to channel, retrying”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:719] [“Failed to connect to channel, retrying”]
[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinis
hed(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”]
[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcF
inished(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”] [to_addr=10.9.16.130:2
0160]
[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinis
hed(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”]
[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcF
inished(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”] [to_addr=10.9.99.96:20
160]
[2020/04/22 03:23:03.235 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.99.96:20160 fail, the gRPC conne
ction could be broken”]
[2020/04/22 03:23:03.239 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/
raft_client.rs:216]: RaftClient send fail")”]
[2020/04/22 03:23:03.239 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.16.130:20160 fail, the gRPC conn
ection could be broken”]
[2020/04/22 03:23:03.244 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/
raft_client.rs:216]: RaftClient send fail")”]
[2020/04/22 03:23:03.244 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.16.130:20
160] [store_id=2429]
[2020/04/22 03:23:03.244 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a
ddr=10.9.16.130:20160]
[2020/04/22 03:23:03.244 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.99.96:201
60] [store_id=5]
[2020/04/22 03:23:03.244 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a
ddr=10.9.99.96:20160]

  1. 从 TiKV 日志里面看需要先排查一下 TiKV 节点的网络是不是有抖动、瞬断的问题

  2. 排查一下 TiKV 服务是否重启,demsg log 里面有没有服务 oom 的报错。

  3. 另外建议发一下 TiKV-details、PD、TiDB 的监控截图,你可以通过 Chrome 浏览器的 full pages screen 插件截取 TiKV 出现异常的 前后 2小时间的监控,我们来做进一步分析。

dmesg看了下没有OOM报错的日志。tikv日志在时间点附近没看到有网络相关的log,有关键字么,我查一下。

截图太大放在了云盘

链接:https://pan.baidu.com/s/1IuDRe1NsKjts__zqnZQQow 密码:eus5

您好:

      从上面的日志看,有一些os的异常, 能否麻烦查看 操作系统日志/var/log/message ,不同操作系统可能日志目录不同,看下对应的message在问题发生阶段,是否有报错,多谢

没有看到tidb报错的日志,不过有kafka报的链接异常

[2020/04/22 03:23:00.219 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a ddr=10.9.16.130:20160] [2020/04/22 03:23:03.244 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [a ddr=10.9.99.96:20160]

上面这样的日志是代表重联成功了么

目前3台tikv中有一台出现了网络抖动,但是集群整体的查询时间达到了十几秒,对我们的服务来说基本是不可用的状态,我想问下这种情况是否正常呢,貌似达不到高可用的状态

您好:

    1. 请把出问题时间段的/var/log/message 操作系统日志 和 tikv.log 上传
    2. 请帮忙获取over-view,tidb, detail-tikv 的监控信息.  时间选择出问题时间段

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

截图太大老是上传失败,放在了云盘

链接:百度网盘 请输入提取码 密码:eus5

messages

tikv.log

[2020/04/22 03:18:45.349 +08:00] [INFO] [gc_worker.rs:904] [“gc_worker: start auto gc”] [safe_point=416152543523504128]

[2020/04/22 03:19:05.848 +08:00] [INFO] [gc_worker.rs:944] [“gc_worker: finished auto gc”] [processed_regions=7601]

[2020/04/22 03:23:00.161 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”]

[2020/04/22 03:23:00.161 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”] [to_addr=10.9.99.96:20160]

[2020/04/22 03:23:00.161 +08:00] [INFO] [store.rs:1946] [“broadcasting unreachable”] [unreachable_store_id=5] [store_id=59697]

[2020/04/22 03:23:00.167 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”]

[2020/04/22 03:23:00.167 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))”] [to_addr=10.9.16.130:20160]

[2020/04/22 03:23:00.180 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.99.96:20160 fail, the gRPC connection could be broken”]

[2020/04/22 03:23:00.194 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/raft_client.rs:216]: RaftClient send fail")”]

[2020/04/22 03:23:00.194 +08:00] [WARN] [raft_client.rs:207] [“send to 10.9.16.130:20160 fail, the gRPCconnection could be broken”]

[2020/04/22 03:23:00.202 +08:00] [INFO] [store.rs:1946] [“broadcasting unreachable”] [unreachable_store_id=2429] [store_id=59697]

[2020/04/22 03:23:00.209 +08:00] [ERROR] [transport.rs:318] [“send raft msg err”] [err=“Other("[src/server/raft_client.rs:216]: RaftClient send fail")”]

[2020/04/22 03:23:00.217 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.99.96:20160] [store_id=5]

[2020/04/22 03:23:00.217 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [addr=10.9.99.96:20160]

[2020/04/22 03:23:00.219 +08:00] [INFO] [transport.rs:299] [“resolve store address ok”] [addr=10.9.16.130:20160] [store_id=2429]

[2020/04/22 03:23:00.219 +08:00] [INFO] [raft_client.rs:50] [“server: new connection with tikv endpoint”] [addr=10.9.16.130:20160]

[2020/04/22 03:23:00.719 +08:00] [WARN] [endpoint.rs:454] [error-response] [err=“region message: "peeris not leader for region 135761, leader may Some(id: 135763 store_id: 2429)" not_leader { region_id: 135761 leader { id: 135763 store_id: 2429 } }”]

[2020/04/22 03:23:01.779 +08:00] [WARN] [endpoint.rs:454] [error-response] [err=“region message: "peeris not leader for region 164706, leader may Some(id: 164708 store_id: 2429)" not_leader { region_id: 164706 leader { id: 164708 store_id: 2429 } }”]

[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {"created":"@1587496983.223140374","description":"Failed to connect to remote host: OS Error","errno":110,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":212,"os_error":"Connection timed out","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.9.16.130:20160"}”]
[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {"created":"@1587496983.223120970","description":"Failed to connect to remote host: OS Error","errno":110,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":212,"os_error":"Connection timed out","syscall":"getsockopt(SO_ERROR)","target
_address":"ipv4:10.9.99.96:20160"}”]

[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:758] [“Subchannel 0x7fa874396600: Retry immediately”]

[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:758] [“Subchannel 0x7fa874396a00: Retry immediately”]

[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:719] [“Failed to connect to channel, retrying”]

[2020/04/22 03:23:03.226 +08:00] [INFO] [subchannel.cc:719] [“Failed to connect to channel, retrying”]

[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”]

[2020/04/22 03:23:03.226 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))”] [to_addr=10.9.16.130:20160]

只有一台节点有这个问题,但是问题发生时基本不能提供服务了

  1. 看监控3个tikv实例,问题发生的是145实例吗? 从日志看访问报错,似乎是OS有些问题,但是您没有上传完整的OS日志,如果不方便,请检查/var/log/message日志,类似kafka断开前,是否还有其他报错。 另外,这个kafka是和tikv部署在同个服务器吗?

  2. tikv实例出问题后,开始有大量的调度,这个时候,如果在此有问题实例上的leader,业务访问时会有问题,所以大概有一段时间感觉业务有很多业务受影响。

  1. 问题发生的是145。messages只有kafka的报错,没有其他的,和tikv是一台机器上。

  2. 3台机器如果仅有一台有抖动就不能提供服务的话,对高可用性来说就是没达到吧。有什么办法规避么

  1. 麻烦反馈下over-view ,145实例的disk-performance,node-exporter监控,注意是问题发生时的监控,我看下当时的内存和IO等信息。 可以看到kafka当时报了IO time out,应该是这个服务器当时出了问题
  2. 不是有一台有抖动就不能提供服务,是这个有问题的服务也要有一个leader转移的时间,这个时间这部分leader的信息没法提供服务。
  3. 请问您这个kafka和tidb有关系吗?是给binlog工具使用的吗?如果没有,建议可以分开部署

最近几天已经多次发生了,麻烦看下

另两个太大,还是放网盘了

链接:百度网盘 请输入提取码 密码:r4vr

刚在96的节点也发生了一次,现象一样,网络超时,但是网络链接都是好的

  1. 麻烦还是把tikv.log 和 /var/log/message 上传下吧. 或者反馈下dmesg的结果,多谢
  2. 初步看和kafka这里有关系 https://elasticsearch.cn/question/7876 ,请问这个是binlog使用的kafka吗?

kafka主要是收集系统日志用的,是和kafka有冲突吗

messages日志 messages (102.1 KB)

tikv.log.2020-04-22-09_41_23.zip (1.6 MB)

你好,

你的问题已收到,正在分析,请稍等

您好: 从日志看,message日志kafka报错后,tikv日志开始打印错误。能否尝试先关闭kafka或者移动到其他服务器,看看是否还会有这个问题,多谢

有点疑问,我们不是最近才部署kafka,服务一直都在,是什么原因触发了tikv的错误呢,谢谢