TiCDC日志出现retrying of unary invoker failed错误,同步有大量日志文件

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】tidb 4.0.9

【问题描述】TiCDC同步出现错误日志信息,数据也不同步,请问是内存不足的原因吗?有什么解决办法
另外还发现 ALTER TABLE 语句,TiCDC服务并没有同步到下游mysql数据库上,这是什么原因

{“level”:“warn”,“ts”:“2021-04-07T07:23:44.353+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-53f3be3f-1cb5-4d1c-95da-e9b95a897ca5/192.168.0.100:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
{“level”:“warn”,“ts”:“2021-04-07T07:23:30.893+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-53f3be3f-1cb5-4d1c-95da-e9b95a897ca5/192.168.0.100:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x28cd1f3, 0x16)
runtime/panic.go:774 +0x72
runtime.sysMap(0xc6fc000000, 0x4000000, 0x49be838)
runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x49a3ec0, 0x2000, 0x41d912, 0x7f2014aa9008)
runtime/malloc.go:701 +0x1cd
runtime.(*mheap).grow(0x49a3ec0, 0x1, 0xffffffff)
runtime/mheap.go:1252 +0x42
runtime.(*mheap).allocSpanLocked(0x49a3ec0, 0x1, 0x49be848, 0xc000077820)
runtime/mheap.go:1163 +0x291
runtime.(*mheap).alloc_m(0x49a3ec0, 0x1, 0xc001180007, 0x45b9fa)
runtime/mheap.go:1015 +0xc2
runtime.(*mheap).alloc.func1()
runtime/mheap.go:1086 +0x4c
runtime.systemstack(0x0)
runtime/asm_amd64.s:370 +0x66
runtime.mstart()
runtime/proc.go:1146

goroutine 599 [running]:
runtime.systemstack_switch()
runtime/asm_amd64.s:330 fp=0xc16e53e828 sp=0xc16e53e820 pc=0x45d790
runtime.(*mheap).alloc(0x49a3ec0, 0x1, 0x10007, 0x0)
runtime/mheap.go:1085 +0x8a fp=0xc16e53e878 sp=0xc16e53e828 pc=0x425f0a
runtime.(*mcentral).grow(0x49a4420, 0x0)
runtime/mcentral.go:255 +0x7b fp=0xc16e53e8b8 sp=0xc16e53e878 pc=0x417f6b
runtime.(*mcentral).cacheSpan(0x49a4420, 0x0)
runtime/mcentral.go:106 +0x2fe fp=0xc16e53e918 sp=0xc16e53e8b8 pc=0x417a8e
runtime.(*mcache).refill(0x7f2014ab2b18, 0x107)
runtime/mcache.go:138 +0x85 fp=0xc16e53e938 sp=0xc16e53e918 pc=0x417535
runtime.(*mcache).nextFree(0x7f2014ab2b18, 0x7, 0x1c, 0xc6fbff4bd0, 0x20)
runtime/malloc.go:854 +0x87 fp=0xc16e53e970 sp=0xc16e53e938 pc=0x40be67
runtime.mallocgc(0x20, 0x0, 0x0, 0xc6fbff4cf0)
runtime/malloc.go:1022 +0x793 fp=0xc16e53ea10 sp=0xc16e53e970 pc=0x40c7a3
runtime.slicebytetostring(0x0, 0xc6fbffbfe0, 0x1c, 0x20, 0x0, 0x0)
runtime/string.go:102 +0x9f fp=0xc16e53ea40 sp=0xc16e53ea10 pc=0x449caf
github.com/pingcap/ticdc/cdc/sink.genTxnKeys(0xc15f0d0680, 0x40c514, 0xc6fbff9330, 0x10)
github.com/pingcap/ticdc@/cdc/sink/causality.go:81 +0x172 fp=0xc16e53ec18 sp=0xc16e53ea40 pc=0x1c46f12
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).dispatchAndExecTxns.func2(0xc15f0d0680)
github.com/pingcap/ticdc@/cdc/sink/mysql.go:632 +0x7f fp=0xc16e53eca8 sp=0xc16e53ec18 pc=0x1c5c2af
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).dispatchAndExecTxns.func3(0xc15f0d0680)
github.com/pingcap/ticdc@/cdc/sink/mysql.go:648 +0x6b fp=0xc16e53ed00 sp=0xc16e53eca8 pc=0x1c5c49b
github.com/pingcap/ticdc/cdc/sink.(*txnsHeap).iter(0xc181551aa0, 0xc16e53ed88)
github.com/pingcap/ticdc@/cdc/sink/txns_heap.go:76 +0xc3 fp=0xc16e53ed58 sp=0xc16e53ed00 pc=0x1c5aba3
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).dispatchAndExecTxns(0xc0007200a0, 0x2e8c1e0, 0xc000b48580, 0xc181bd2900)
github.com/pingcap/ticdc@/cdc/sink/mysql.go:646 +0x1df fp=0xc16e53ef18 sp=0xc16e53ed58 pc=0x1c50aef
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).flushRowChangedEvents(0xc0007200a0, 0x2e8c1e0, 0xc000b48580)
github.com/pingcap/ticdc@/cdc/sink/mysql.go:159 +0x1a8 fp=0xc16e53efc8 sp=0xc16e53ef18 pc=0x1c4bd18
runtime.goexit()
runtime/asm_amd64.s:1357 +0x1 fp=0xc16e53efd0 sp=0xc16e53efc8 pc=0x45f6e1
created by github.com/pingcap/ticdc/cdc/sink.newMySQLSink
github.com/pingcap/ticdc@/cdc/sink/mysql.go:579 +0xab9

goroutine 1 [semacquire, 16612 minutes]:
sync.runtime_Semacquire(0xc000889390)
runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc000889388)
sync/waitgroup.go:130 +0x64
golang.org/x/sync/errgroup.(*Group).Wait(0xc000889380, 0xc00034ddc0, 0x2e8c1e0)
golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:40 +0x31
github.com/pingcap/ticdc/cdc.(*Server).run(0xc0004573f0, 0x2e8c2a0, 0xc000843a70, 0x0, 0x0)
github.com/pingcap/ticdc@/cdc/server.go:341 +0x3b2
github.com/pingcap/ticdc/cdc.(*Server).Run(0xc0004573f0, 0x2e8c1e0, 0xc0003ec0c0, 0x0, 0x0)
github.com/pingcap/ticdc@/cdc/server.go:255 +0x5db
github.com/pingcap/ticdc/cmd.runEServer(0x49407a0, 0xc00039fa00, 0x0, 0x8, 0x0, 0x0)
github.com/pingcap/ticdc@/cmd/server.go:118 +0x389
github.com/spf13/cobra.(*Command).execute(0x49407a0, 0xc00039f980, 0x8, 0x8, 0x49407a0, 0xc00039f980)
github.com/spf13/cobra@v1.0.0/command.go:842 +0x460
github.com/spf13/cobra.(*Command).ExecuteC(0x4940500, 0x40580f, 0xc0000ca058, 0x0)
github.com/spf13/cobra@v1.0.0/command.go:950 +0x349
github.com/spf13/cobra.(*Command).Execute(…)
github.com/spf13/cobra@v1.0.0/command.go:887
github.com/pingcap/ticdc/cmd.Execute()
github.com/pingcap/ticdc@/cmd/root.go:32 +0x61
main.main()
github.com/pingcap/ticdc@/main.go:22 +0x20

goroutine 6 [syscall, 16612 minutes]:
os/signal.signal_recv(0x0)
runtime/sigqueue.go:147 +0x9c
os/signal.loop()
os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
os/signal/signal_unix.go:29 +0x41

goroutine 135 [chan receive, 16612 minutes]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc0001556c0)
github.com/klauspost/compress@v1.11.1/zstd/blockdec.go:215 +0x16c
created by github.com/klauspost/compress/zstd.newBlockDec
github.com/klauspost/compress@v1.11.1/zstd/blockdec.go:118 +0x166

  1. 看日志是 out of memory 内存不足了, 当前配置是什么? 可以看下内存监控
  2. 同步任务是什么? 同步的表或者数据量很大吗? 如果是,可以考虑拆分一下

服务器配置表
c5.4xlarge
16 核+
32 GB+
SSD 500G × 1
万兆网卡(2块)

这是我们集群的服务部署情况

我们这台CDC服务器是混合部署的,上面还有PD和TiDB服务,这样的配置和部署是否存在问题?

如果TiCDC同步服务占用资源比较高,换成 TiDB Binlog同步服务会不会好一些?

  1. TiCDC 资源配置需要 64G

  2. 不知道是否会考虑 5.0 GA 版本? 如果是这种情况导致的内存OOM,可以使用磁盘SSD

https://docs.pingcap.com/zh/tidb/stable/release-5.0.0#ticdc-稳定性提升缓解同步过多增量变更数据的-oom-问题

我们tidb v4.0.9能直接升级到5.0版本吗?升级过程中是否会出错导致业务中断呢,我们想尝试下,我刚看了下5.0版本昨天才发布上线

  1. PD 升级过程中 leader 会切换,会有短暂的不可用,大概几秒钟到几十秒吧。 TiKV 升级过程中,leader 会切换,这个时候业务可能会有影响,只要业务上有重试就没问题。 最好在业务低峰期,或者晚上进行。
  2. 5.0 GA 是刚发布,或者您可以在测试集群先试试。