tidb没有任何业务,ticdc突然就卡住了

ticdc version:5.3.0

ticdc最后几条日志:

2022/03/10 14:09:18.435 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=payment-task] [capture=10.59.110.32:8300] [count=0] [qps=0] [ddl=0]
[2022/03/10 14:11:09.347 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=verify-task] [capture=10.59.110.32:8300] [count=34] [qps=0] [ddl=0]
[2022/03/10 14:19:18.934 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=payment-task] [capture=10.59.110.32:8300] [count=0] [qps=0] [ddl=0]
[2022/03/10 14:21:09.634 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=verify-task] [capture=10.59.110.32:8300] [count=39] [qps=0] [ddl=0]
[2022/03/10 14:29:20.124 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=payment-task] [capture=10.59.110.32:8300] [count=3] [qps=0] [ddl=0]
[2022/03/10 14:31:09.753 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=verify-task] [capture=10.59.110.32:8300] [count=22] [qps=0] [ddl=0]
[2022/03/10 14:39:20.434 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=payment-task] [capture=10.59.110.32:8300] [count=1] [qps=0] [ddl=0]
[2022/03/10 14:41:09.975 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=verify-task] [capture=10.59.110.32:8300] [count=25] [qps=0] [ddl=0]
[2022/03/10 14:49:20.436 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=payment-task] [capture=10.59.110.32:8300] [count=0] [qps=0] [ddl=0]
[2022/03/10 14:51:10.034 +08:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=verify-task] [capture=10.59.110.32:8300] [count=15] [qps=0] [ddl=0]

ticdc监控:貌似owner没了
ticdc.json (2.7 MB)


image

但是tiup查看capture都是正常:

[
  {
    "id": "15df6a51-0fb6-4d15-b4cf-16c9badcb377",
    "is-owner": true,
    "address": "10.59.110.133:8300"
  },
  {
    "id": "2a177a94-af2c-4915-a6cb-1d51985971b4",
    "is-owner": false,
    "address": "10.59.110.207:8300"
  },
  {
    "id": "7129d057-2cd3-4b91-98bb-4dc9a745c796",
    "is-owner": false,
    "address": "10.59.110.32:8300"
  }
]

看processor memory就突然往上增长

请把之前的owner的日志发一下,看监控现在是没有owner,所以全部等待。

tiup查看是正常的

把owner的日志发一下

cdc.log (157.3 KB)

owner只到3月8八号

这是另外两个节点的日志:
cdc3.log (102.1 KB) cdc2.log (425.8 KB)

请问有什么其他操作,3.4 3.8号

没有,就正常同步

3.4没有任何问题
image

3.8也没有任何问题
image

尝试重启一下owner这台。然后观察一下日志,看是否可以选举出owner

重启之后,恢复正常了

什么问题?
tiup看capture owner为什么是正常的?

期间也没有任何报错信息

是bug么?

麻烦看下133这个节点是否有cdc_err.log

重启期间有些error信息:

[2022/03/10 15:52:26.201 +08:00] [INFO] [helper.go:63] ["got signal to exit"] [signal=terminated]
[2022/03/10 15:52:26.201 +08:00] [ERROR] [client.go:750] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"]
[2022/03/10 15:52:26.201 +08:00] [INFO] [client.go:669] ["[pd] exit tso dispatcher"] [dc-location=global]
[2022/03/10 15:52:26.201 +08:00] [INFO] [capture.go:254] ["run owner exited"] [error="[CDC:ErrPDEtcdAPIError]context canceled: context canceled"] [errorVerbose="[CDC:ErrPDEtcdAPIError]context canceled: context canceled\
github.com/pingcap/errors.AddStack\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/errors.go:174\
github.com/pingcap/errors.(*Error).GenWithStackByCause\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/normalize.go:302\
github.com/pingcap/ticdc/pkg/errors.WrapError\
\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:30\
github.com/pingcap/ticdc/cdc/capture.(*Capture).runEtcdWorker\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:287\
github.com/pingcap/ticdc/cdc/capture.(*Capture).campaignOwner\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:252\
github.com/pingcap/ticdc/cdc/capture.(*Capture).run.func2\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:177\
runtime.goexit\
\truntime/asm_amd64.s:1371"]
[2022/03/10 15:52:26.202 +08:00] [INFO] [capture.go:178] ["the owner routine has exited"] [error="resign owner failed, capture: 15df6a51-0fb6-4d15-b4cf-16c9badcb377: [CDC:ErrCaptureResignOwner]context canceled: context canceled"] [errorVerbose="[CDC:ErrCaptureResignOwner]context canceled: context canceled\
github.com/pingcap/errors.AddStack\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/errors.go:174\
github.com/pingcap/errors.(*Error).GenWithStackByCause\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/normalize.go:302\
github.com/pingcap/ticdc/pkg/errors.WrapError\
\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:30\
github.com/pingcap/ticdc/cdc/capture.(*Capture).resign\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:327\
github.com/pingcap/ticdc/cdc/capture.(*Capture).campaignOwner\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:256\
github.com/pingcap/ticdc/cdc/capture.(*Capture).run.func2\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:177\
runtime.goexit\
\truntime/asm_amd64.s:1371\
resign owner failed, capture: 15df6a51-0fb6-4d15-b4cf-16c9badcb377"]
[2022/03/10 15:52:26.202 +08:00] [INFO] [capture.go:189] ["the processor routine has exited"] [error="[CDC:ErrPDEtcdAPIError]context canceled: context canceled"] [errorVerbose="[CDC:ErrPDEtcdAPIError]context canceled: context canceled\
github.com/pingcap/errors.AddStack\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/errors.go:174\
github.com/pingcap/errors.(*Error).GenWithStackByCause\
\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/normalize.go:302\
github.com/pingcap/ticdc/pkg/errors.WrapError\
\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:30\
github.com/pingcap/ticdc/cdc/capture.(*Capture).runEtcdWorker\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:287\
github.com/pingcap/ticdc/cdc/capture.(*Capture).run.func3\
\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:188\
runtime.goexit\
\truntime/asm_amd64.s:1371"]
[2022/03/10 15:52:26.202 +08:00] [WARN] [client.go:1162] ["failed to receive from stream"] [addr=10.59.105.50:20161] [storeID=2] [error="rpc error: code = Unavailable desc = transport is closing"]
[2022/03/10 15:52:26.216 +08:00] [INFO] [capture.go:142] ["capture recovered"] [capture-id=15df6a51-0fb6-4d15-b4cf-16c9badcb377]
[2022/03/10 15:52:26.216 +08:00] [INFO] [capture.go:119] ["the capture routine has exited"]
[2022/03/10 15:52:26.216 +08:00] [ERROR] [client.go:750] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"]
[2022/03/10 15:52:26.216 +08:00] [INFO] [client.go:669] ["[pd] exit tso dispatcher"] [dc-location=global]
[2022/03/10 15:52:26.217 +08:00] [INFO] [server.go:135] ["cdc server exits successfully"]
[2022/03/10 15:52:27.143 +08:00] [INFO] [helper.go:51] ["init log"] [file=/data/cdc/8300/log/cdc.log] [level=info]
[2022/03/10 15:52:27.144 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.3.0] [git-hash=20626babf21fc381d4364646c40dd84598533d66] [git-branch=heads/refs/tags/v5.3.0] [utc-build-time="2021-11-22 10:37:02"] [go-version="go version go1.16.4 linux/amd64"] [failpoint-build=false]
[2022/03/10 15:52:27.144 +08:00] [INFO] [server.go:67] ["creating CDC server"] [pd-addrs="[http://10.59.105.60:2379,http://10.59.105.61:2379,http://10.59.105.62:2379]"] [config="{\"addr\":\"0.0.0.0:8300\",\"advertise-addr\":\"10.59.110.133:8300\",\"log-file\":\"/data/cdc/8300/log/cdc.log\",\"log-level\":\"info\",\"log\":{\"file\":{\"max-size\":300,\"max-days\":0,\"max-backups\":0}},\"data-dir\":\"/data/cdc/8300/store\",\"gc-ttl\":86400,\"tz\":\"System\",\"capture-session-ttl\":10,\"owner-flush-interval\":200000000,\"processor-flush-interval\":100000000,\"sorter\":{\"num-concurrent-worker\":4,\"chunk-size-limit\":134217728,\"max-memory-percentage\":30,\"max-memory-consumption\":17179869184,\"num-workerpool-goroutine\":16,\"sort-dir\":\"/tmp/sorter\"},\"security\":{\"ca-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null},\"per-table-memory-quota\":10485760,\"kv-client\":{\"worker-concurrent\":8,\"worker-pool-size\":0,\"region-scan-limit\":40}}"]

cdc_stderr.log. 是否有这个日志存在

有,不过只到3月6号

我直接截图吧,少量错误日志:

麻烦上传一下pd的监控以及pd的日志

TiDB_2022-03-10T08_13_56.419Z.json (6.8 MB)
TiKV-Details_2022-03-10T08_12_11.368Z.json (27.4 MB)
PD_2022-03-10T08_11_25.329Z.json (7.7 MB)

pd,tikv,tidb 3小时之内的监控

pd的日志