ticdc unified sorter 几个疑问

1.环境部署

tidb 版本:  v5.1.0
172.29.238.238:8302   cdc   /data/tidb/data/cdc-8302         /data/tidb/deploy/cdc-8302
192.168.149.155:8302  cdc   /data/tidb/data/cdc-8302         /data/tidb/deploy/cdc-8302
192.168.149.156:8302  cdc   /data/tidb/data/cdc-8302         /data/tidb/deploy/cdc-8302

2.使用场景
将tidb 集群数据通过ticdc 同步到下游tidb 集群, 做双云之间的集群同步

[
  {
    "id": "tidboom-task",
    "summary": {
      "state": "normal",
      "tso": 426446810058588161,
      "checkpoint": "2021-07-20 15:18:00.265",
      "error": null
    }
  }
]
-------------query 详细信息
{
  "info": {
    "sink-uri": "mysql://root:xxxxxxxxxx@172.29.238.xxx:4920/?worker-count=32\u0026max-txn-row=5000",
    "opts": {
      "_changefeed_id": "cli-verify"
    },
    "create-time": "2021-07-15T14:27:26.679806641+08:00",
    "start-ts": 426332768610549761,
    "target-ts": 0,
    "admin-job-type": 0,
    "sort-engine": "unified",
    "config": {
      "case-sensitive": true,
      "enable-old-value": true,
      "force-replicate": false,
      "check-gc-safe-point": true,
      "filter": {
        "rules": [
          "*.*"
        ],
        "ignore-txn-start-ts": null
      },
      "mounter": {
        "worker-num": 16
      },
      "sink": {
        "dispatchers": null,
        "protocol": "default"
      },
      "cyclic-replication": {
        "enable": false,
        "replica-id": 0,
        "filter-replica-ids": null,
        "id-buckets": 0,
        "sync-ddl": false
      },
      "scheduler": {
        "type": "table-number",
        "polling-time": -1
      }
    },
    "state": "normal",
    "history": null,
    "error": null,
    "sync-point-enabled": false,
    "sync-point-interval": 600000000000,
    "creator-version": "v5.1.0"
  },

3.问题描述
在使用sysbench 压测tidb 集群时, ticdc 的日志中出现如下信息

[2021/07/20 15:31:04.037 +08:00] [INFO] [backend_pool.go:161] ["Temporary file removed"] [file=/data/tidb/data/cdc-8302/tmp/sorter/sort-366-1296.tmp]
[2021/07/20 15:31:04.037 +08:00] [INFO] [backend_pool.go:161] ["Temporary file removed"] [file=/data/tidb/data/cdc-8302/tmp/sorter/sort-366-1299.tmp]

此时cdc 使用内存如下图

问题1:为什么会出现日志中临时文件的信息 ??
问题2:如果使用了临时文件,为何没有在我配置文件中指定的位置出现 ?

配置文件信息如下

cdc:
    sorter.sort-dir: /data/tidb/cdc_sort

-- cdc conf 中信息如下
[root@db-redis-149-156 cdc-8302]# more conf/cdc.toml
[sorter]
sort-dir = "/data/tidb/cdc_sort"

4.tidb 以及ticdc 监控大盘如下
tidb-oooom-TiCDC_2021-07-20T07_37_22.474Z.json (3.7 MB)
tidb-oooom-Overview_2021-07-20T07_46_54.647Z.json (1.1 MB)

2赞

CDC 日志也提供一下(要包含 task 启动时间的)

1赞

cdc-156.log (1.1 MB) cdc-155.log (1.3 MB)

这两个cdc 是我下午新扩容上去的,日志比较完整。

1赞

1、sort-dir 问题: https://docs.pingcap.com/zh/tidb/stable/ticdc-overview#sort-dir-及-data-dir-配置项的兼容性说明
2、临时文件,还需确认一下

今天又遇到了这个问题

[
  {
    "id": "tidboom-task",
    "summary": {
      "state": "normal",
      "tso": 426470011965276162,
      "checkpoint": "2021-07-21 15:53:08.515",
      "error": {
        "addr": "192.168.149.156:8302",
        "code": "CDC:ErrProcessorUnknown",
        "message": "[CDC:ErrMySQLTxnError]Error 1105: runtime error: index out of range [7] with length 7"
      }
    }
  }
]

日志里有如下信息

[2021/07/21 17:29:01.591 +08:00] [WARN] [mysql.go:903] ["failed to rollback txn"] [error="Error 1105: runtime error: index out of range [7] with length 7"]
[2021/07/21 17:29:01.591 +08:00] [WARN] [mysql.go:882] ["execute DMLs with error, retry later"] [error="[CDC:ErrMySQLTxnError]Error 1105: runtime error: index out of range [7] with length 7"] [errorVerbose="[CDC:ErrMySQLTxnError]Error 1105: runtime error: index out of range [7] with length 7\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2.3\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:905\ngithub.com/pingcap/ticdc/cdc/sink.(*Statistics).RecordBatchExecution\n\tgithub.com/pingcap/ticdc/cdc/sink/statistics.go:99\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:893\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:885\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLs\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:1044\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSinkWorker).run.func3\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:799\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSinkWorker).run\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:820\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).createSinkWorkers.func1\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:640\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

1、超过内存限制(2个指标:内存分配默认 16G、当前系统可用内存百分比),或者下游写入慢,触发流控,都会触发磁盘缓存,有没有用到磁盘缓存,可以看 ticdc 监控项 Unified Sorter-Unified Sorter on disk data size
这里注意:是当前系统,可用内存的百分比(是指当前系统上所有进程的内存使用,而不是 cdc 本身。如果只考虑 cdc 内存使用,在用户混合部署或者有其他大型应用干扰的情况下,很有可能出现 oom 问题,甚至让用户其他业务中断,所以这一点设置得比较激进。若有需求可以考虑修改 config 文件中的 sorter/max-memory-percentage 为更高百分比,比如 90)

1赞

上面的 1105 错误,最好给一下 sdterr 文件(tidb-server 的)

刚才复现了下那个index 的问题
在上游执行了修改字段的操作
alter table sbtest1 modify column pad varchar(90); 从varchar 100 到 varchar90

tidb-server 日志输出如下

[2021/07/29 17:43:50.585 +08:00] [ERROR] [conn.go:801] ["connection running loop panic"] [conn=1771] [lastSQL="DELETE FROM `oom`.`sbtest1` WHERE `id` = 280594 LIMIT 1;"] [err="runtime error: index out of range [7] with length 7"] [stack="goroutine 3234053 [running]:\ngithub.com/pingcap/tidb/server.(*clientConn).Run.func1(0x3bea358, 0xc002028030, 0xc0017ce700)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:799 +0xf5\npanic(0x352b260, 0xc000d57938)\n\t/usr/local/go/src/runtime/panic.go:965 +0x1b9\ngithub.com/pingcap/tidb/executor.(*ExecStmt).Exec.func1(0xc0004c7ee0, 0xc002350b10, 0xc002350af0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:332 +0x4d4\npanic(0x352b260, 0xc000d57938)\n\t/usr/local/go/src/runtime/panic.go:965 +0x1b9\nencoding/binary.littleEndian.Uint64(...)\n\t/usr/local/go/src/encoding/binary/binary.go:77\ngithub.com/pingcap/tidb/util/rowcodec.decodeInt(0xc001847279, 0x7, 0xc7, 0x44812)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/rowcodec/common.go:119 +0x94\ngithub.com/pingcap/tidb/util/rowcodec.(*ChunkDecoder).decodeColToChunk(0xc001cc0180, 0x1, 0xc0014dbcf8, 0xc001847279, 0x7, 0xc7, 0xc001564b90, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/rowcodec/decoder.go:275 +0x113\ngithub.com/pingcap/tidb/util/rowcodec.(*ChunkDecoder).DecodeToChunk(0xc001cc0180, 0xc001847260, 0xd6, 0xe0, 0x3c165c8, 0xc001700d38, 0xc001564b90, 0x8, 0x8)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/rowcodec/decoder.go:216 +0x2e5\ngithub.com/pingcap/tidb/executor.DecodeRowValToChunk(0x3c29b90, 0xc0011e5400, 0xc001564a00, 0xc001c3ef00, 0x3c165c8, 0xc001700d38, 0xc001847260, 0xd6, 0xe0, 0xc001564b90, ...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/point_get.go:495 +0x8e\ngithub.com/pingcap/tidb/executor.(*PointGetExecutor).Next(0xc001d09380, 0x3bea358, 0xc0021efec0, 0xc001564b90, 0xc001700e60, 0xc001700e50)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/point_get.go:285 +0x553\ngithub.com/pingcap/tidb/executor.Next(0x3bea358, 0xc0021efec0, 0x3bee718, 0xc001d09380, 0xc001564b90, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/executor.go:286 +0x2de\ngithub.com/pingcap/tidb/executor.(*DeleteExec).deleteSingleTableByChunk(0xc000c9b4a0, 0x3bea358, 0xc0021efec0, 0x0, 0x29205895130200)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/delete.go:94 +0x5ca\ngithub.com/pingcap/tidb/executor.(*DeleteExec).Next(0xc000c9b4a0, 0x3bea358, 0xc0021efec0, 0xc001564b40, 0x0, 0xc002350720)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/delete.go:50 +0x92\ngithub.com/pingcap/tidb/executor.Next(0x3bea358, 0xc0021efec0, 0x3bee118, 0xc000c9b4a0, 0xc001564b40, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/executor.go:286 +0x2de\ngithub.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelayExecutor(0xc0004c7ee0, 0x3bea358, 0xc0021efec0, 0x3bee118, 0xc000c9b4a0, 0x0, 0x0, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:589 +0x2e7\ngithub.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelay(0xc0004c7ee0, 0x3bea358, 0xc0021efec0, 0x3bee118, 0xc000c9b4a0, 0x5782800, 0x3bea301, 0x0, 0x0, 0x0, ...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:470 +0x1e5\ngithub.com/pingcap/tidb/executor.(*ExecStmt).Exec(0xc0004c7ee0, 0x3bea358, 0xc0021efec0, 0x0, 0x0, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:419 +0x707\ngithub.com/pingcap/tidb/session.runStmt(0x3bea358, 0xc0021efbf0, 0xc0011e5400, 0x3c00298, 0xc0004c7ee0, 0x0, 0x0, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/ses"]

ticdc 日志如下

[ERROR] [mysql.go:1045] ["execute DMLs failed"] [err="[CDC:ErrMySQLTxnError]sql: database is closed"]
[2021/07/29 17:44:12.031 +08:00] [INFO] [mysql.go:645] ["mysql sink receives redundant error"] [error="[CDC:ErrMySQLTxnError]sql: database is closed"] [errorVerbose="[CDC:ErrMySQLTxnError]sql: database is closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2.3\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:896\ngithub.com/pingcap/ticdc/cdc/sink.(*Statistics).RecordBatchExecution\n\tgithub.com/pingcap/ticdc/cdc/sink/statistics.go:99\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries.func2\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:893\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLWithMaxRetries\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:885\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDMLs\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:1044\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSinkWorker).run.func3\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:799\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSinkWorker).run\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:837\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).createSinkWorkers.func1\n\tgithub.com/pingcap/ticdc/cdc/sink/mysql.go:640\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

请问以上信息是否能判断出问题在哪。这个环境是新搭建的,我测试了一些删除数据的没啥问题。具体拓扑见


在delete 数据的时候没问题,然后做字段有损变更就遇到上面的问题了 。

好的,这个我反馈一下