一个节点部署了pd和tidb组件，现全部挂掉问题请教

变又未变 · 2021 年5 月 14 日 10:01

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【TiDB 版本】
v4.0.10
【问题描述】
3节点机器，一个节点部署了pd和tidb，现在这个节点的pd和tidb组件都挂了：

pd中部分日志：看报错好像是权限问题，这个是什么情况？
[2021/04/16 12:04:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=5] [error=EOF]
[2021/04/16 12:04:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=4] [error=EOF]
[2021/04/16 12:12:19.060 +08:00] [INFO] [grpc_service.go:815] [“update service GC safe point”] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=424292054210707456]
[2021/04/16 12:13:59.694 +08:00] [INFO] [grpc_service.go:760] [“updated gc safe point”] [safe-point=424292054210707456]
[2021/04/16 12:14:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=5] [error=EOF]
[2021/04/16 12:14:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=4] [error=EOF]
[2021/04/16 12:22:19.045 +08:00] [INFO] [grpc_service.go:815] [“update service GC safe point”] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=424292211497107456]
[2021/04/16 12:23:59.560 +08:00] [INFO] [grpc_service.go:760] [“updated gc safe point”] [safe-point=424292211497107456]
[2021/04/16 12:24:57.482 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=4] [error=EOF]
[2021/04/16 12:24:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=5] [error=EOF]
[2021/04/16 12:32:19.045 +08:00] [INFO] [grpc_service.go:815] [“update service GC safe point”] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=424292368783507456]
[2021/04/16 12:33:59.585 +08:00] [INFO] [grpc_service.go:760] [“updated gc safe point”] [safe-point=424292368783507456]
[2021/04/16 12:34:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=5] [error=EOF]
[2021/04/16 12:34:57.483 +08:00] [ERROR] [heartbeat_streams.go:122] [“send keepalive message fail”] [target-store-id=4] [error=EOF]
[2021/04/16 12:35:05.115 +08:00] [FATAL] [server.go:834] [“failed to purge wal file”] [error=“open /xuegangtidb/tidb-data/pd-14279/member/wal: permission denied”] [stack=“go.etcd.io/etcd/etcdserver.(*EtcdServer).purgeFile
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.10/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:834
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1
\t/home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.10/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:2632”]

tidb中部分日志：
[2021/05/07 19:25:08.815 +08:00] [INFO] [coprocessor.go:1034] [“[TIME_COP_WAIT] resp_time:1.741656251s txnStartTS:424774652932390913 region_id:48 store_addr:10.254.10.70:43641 kv_wait_ms:1741”]
[2021/05/07 20:44:26.753 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=427.826537ms]
[2021/05/08 16:13:00.928 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=35.480687ms]
[2021/05/08 16:13:00.928 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=35.312391ms]
[2021/05/09 01:00:43.418 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=40.442237ms]
[2021/05/10 12:35:57.125 +08:00] [WARN] [memory_usage_alarm.go:141] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system memory total”=269803999232] [“system memory usage”=215949737984] [“tidb-server memory usage”=31217496] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1012_tidb/MC4wLjAuMDo0NjQzLzAuMC4wLjA6MTQ1ODM=/tmp-storage/record”]
[2021/05/10 12:38:11.845 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=151.92602ms]
[2021/05/10 12:38:11.844 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=151.63318ms]
[2021/05/10 12:38:14.374 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=124.528886ms]
[2021/05/10 12:38:14.374 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=124.529017ms]
[2021/05/10 12:56:01.068 +08:00] [ERROR] [systime_mon.go:33] [“system time jump backward”] [last=1620622561477910258]
[2021/05/10 12:57:42.776 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=34.463747ms]
[2021/05/10 12:57:42.776 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=34.155725ms]
[2021/05/10 12:58:05.777 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=35.386784ms]
[2021/05/10 12:58:05.777 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=35.386854ms]
[2021/05/10 12:58:12.778 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=36.021575ms]
[2021/05/10 12:58:12.778 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=36.054149ms]
[2021/05/10 13:02:52.220 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=41.049585ms]
[2021/05/10 13:05:06.227 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=54.878538ms]
[2021/05/10 13:06:50.210 +08:00] [WARN] [pd.go:109] [“get timestamp too slow”] [“cost time”=37.409902ms]

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

Lucien · 2021 年5 月 16 日 02:45

检查一下 PD 的节点的目录权限是否正常，看起来是有问题的的。

TiDB Server 报错应该是内存资源不足，确认是不是查询 query 太大了，把内存打满了，触发了 oom 。

TiDB 和 PD 可以混合部署，但是要关注一下资源配置是否满足，目前可能是不满足。

变又未变 · 2021 年5 月 17 日 01:11

1、部署目录：我当初是使用iass_user这个用户部署的，我看了一下，26节点对应的目录都是iass_user，我直接使用chown将27节点对应的用户改为iass_user，可以吧

2、

这个具体是看哪里？我在tidb日志里搜了一下，好像也没有out of memory的关键字

变又未变 · 2021 年5 月 17 日 01:23

我把权限改了一下，现在都启动起来了，这个目录好像被人该权限了，辛苦了，谢谢，有问题我再问

Lucien · 2021 年5 月 17 日 06:18

好的

system · 2022 年10 月 31 日 19:07

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。