invalid connection 问题

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0 DM 1.0.6
  • 【问题描述】:task任务莫名暂停,请帮忙看看为何会提示无效连接。谢谢。
    query-status 结果
    “Message”: “flush checkpoint (mysql-bin|000001.003372, 212976727)(flushed (mysql-bin|000001.003372, 212976727)): execute statement failed: begin: invalid connection”,
    “RawCause”: “invalid connection”

task状态
image

dm-worker日志
[2020/07/29 03:20:42.831 +08:00] [INFO] [subtask.go:266] [“unit process returned”] [subtask=client_building_visited] [unit=Sync] [stage=Paused] [status=“{"totalEvents":6674748,"totalTps":3530,"recentTps":514,"masterBinlog":"(mysql-bin.003372, 156496925)","masterBinlogGtid":"56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-1324259,56fd8730-aa4f-11ea-a7c1-fa3283cd2000:1","syncerBinlog":"(mysql-bin|000001.003361, 145357969)"}”]
[2020/07/29 03:20:42.831 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=client_building_visited] [unit=Sync] [“error information”=“{"msg":"current pos (mysql-bin|000001.003361, 208344628): flush checkpoint (mysql-bin|000001.003361, 145357969)(flushed (mysql-bin|000001.003361, 145357969)): execute statement failed: begin: invalid connection","error":{"ErrCode":10006,"ErrClass":1,"ErrLevel":3,"Message":"current pos (mysql-bin|000001.003361, 208344628): flush checkpoint (mysql-bin|000001.003361, 145357969)(flushed (mysql-bin|000001.003361, 145357969)): execute statement failed: begin: invalid connection","RawCause":"invalid connection"}}”]
[2020/07/29 03:20:42.840 +08:00] [WARN] [db.go:285] [“execute transaction”] [task=client_building_relation] [unit=“binlog replication”] [query=“[UPDATE hdb_broker.client_building_relation SET r_build_id = ?, id = ?, broker_id = ?, client_id = ?, building_id = ?, client_status = ?, client_status_old = ?, is_locked = ?, locked_at = ?, unlocked_at = ?, is_visited = ?, is_signed = ?, protect_time = ?, protect_time_old = ?, creater = ?, create_time = ?, create_time_old = ?, remark = ?, city_id = ?, is_remind = ?, mingyuan_status = ?, exhibition_room_id = ?, client_building_id = ?, visite_company = ?, visite_company_id = ?, visite_place = ?, visite_place_id = ?, client_name = ?, is_id_card = ?, client_id_card = ?, activate_sign = ?, activate_sign_old = ?, bespeak_time = ?, source_id = ?, recommend_source = ?, platform = ?, health_intention = ?, employee_mark = ?, is_black = ?, org_id = ?, org_name = ?, user_type = ?, parent_user_type = ?, update_time = ?, update_time_old = ?, is_special = ?, owner_flag = ?, is_h5_delete = ?, client_phone = ?, is_tuoke = …”] [argument=“[[12765639 d2d8d5eb7edd432e98e958 ec941b0b9764450581cf0dd7d9 0f57f58b2a5d2ba0762f f692a285bb2a495dfdea -20 -20 1 1 1 2020-06-25 2020-06-25 付 2020-05-26 17:23:36 2020-05-26 17:23:36 李 1 08569x 常规通道 0 其他 员工 2020-07-29 02:12:34 0 0 15285538917 0 0 0 0 1276390991506579459 1276366996690313225 0 2020-06-26 00:42:42 58.16.228.1 d2d8d5eb7edd432e98e9562291e744f8] [127656451502 edcbc0eb69884 4160e67a16724d8e9667530a7085a9a5 605f3dded62a3932e28ff8f29f 4cbfe40729d8c201f2aa01356c6 -60 -60 1 1 1 2020-01-19 2020-01-19 罗 2019-12-19 09:48:37 2019-12-19 09:48:37 成雯 1 134976 >常\ufffd\ufffd…”] [“cost time”=1.622802032s]

你好

辛苦将 dm-worker log 提供一个更加完整的版本。可以通过附件进行上传

dm-work.log.gz (2.6 MB)

已上传,还请看看,谢谢。

query-status task-name 看下结果

如下:

 query-status  client_building_relation
{
    "result": true,
    "msg": "",
    "workers": [
        {
            "result": true,
            "worker": "xxx:8263",
            "msg": "",
            "subTaskStatus": [
                {
                    "name": "client_building_relation",
                    "stage": "Paused",
                    "unit": "Sync",
                    "result": {
                        "isCanceled": false,
                        "errors": [
                            {
                                "Type": "UnknownError",
                                "msg": "",
                                "error": {
                                    "ErrCode": 10005,
                                    "ErrClass": 1,
                                    "ErrScope": 2,
                                    "ErrLevel": 3,
                                    "Message": "current pos (mysql-bin|000001.003377, 207955272): query statement failed: SHOW COLUMNS FROM `hdb_broker`.`client_building_relation`: invalid connection",
                                    "RawCause": "invalid connection"
                                }
                            }
                        ],
                        "detail": null
                    },
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "7195154",
                        "totalTps": "119919",
                        "recentTps": "0",
                        "masterBinlog": "(mysql-bin.003377, 208848574)",
                        "masterBinlogGtid": "56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-1326077,56fd8730-aa4f-11ea-a7c1-fa3283cd2000:1",
                        "syncerBinlog": "(mysql-bin|000001.003377, 207954387)",
                        "syncerBinlogGtid": "",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false
                    }
                }
            ],
            "relayStatus": {
                "masterBinlog": "(mysql-bin.003377, 208848574)",
                "masterBinlogGtid": "56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-1326077,56fd8730-aa4f-11ea-a7c1-fa3283cd2000:1",
                "relaySubDir": "56fd8730-aa4f-11ea-a7c1-fa3283cd2000.000001",
                "relayBinlog": "(mysql-bin.003377, 208848574)",
                "relayBinlogGtid": "56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-775208",
                "relayCatchUpMaster": true,
                "stage": "Running",
                "result": null
            },
            "sourceID": "mysql-replica-07"
        },
        {
            "result": true,
            "worker": "xxx:8262",
            "msg": "",
            "subTaskStatus": [
                {
                    "name": "client_building_relation",
                    "stage": "Running",
                    "unit": "Sync",
                    "result": null,
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "8573905",
                        "totalTps": "285796",
                        "recentTps": "0",
                        "masterBinlog": "(mysql-bin.003832, 17576102)",
                        "masterBinlogGtid": "56e32b6c-aa4f-11ea-a507-fa8817860200:1,57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1568631",
                        "syncerBinlog": "(mysql-bin|000001.003832, 17572386)",
                        "syncerBinlogGtid": "",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false
                    }
                }
            ],
            "relayStatus": {
                "masterBinlog": "(mysql-bin.003832, 17576102)",
                "masterBinlogGtid": "56e32b6c-aa4f-11ea-a507-fa8817860200:1,57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1568631",
                "relaySubDir": "56e32b6c-aa4f-11ea-a507-fa8817860200.000001",
                "relayBinlog": "(mysql-bin.003832, 17576102)",
                "relayBinlogGtid": "57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1006962",
                "relayCatchUpMaster": true,
                "stage": "Running",
                "result": null
            },
            "sourceID": "mysql-replica-06"
        }
    ]
}

现在执行的报错是这种

                "stage": "Paused",
                "unit": "Sync",
                "result": {
                    "isCanceled": false,
                    "errors": [
                        {
                            "Type": "UnknownError",
                            "msg": "",
                            "error": {
                                "ErrCode": 10006,
                                "ErrClass": 1,
                                "ErrScope": 0,
                                "ErrLevel": 3,
                                "Message": "flush checkpoint (mysql-bin|000001.003377, 209659560)(flushed (mysql-bin|000001.003377, 209659560)): execute statement failed: begin: invalid connection",
                                "RawCause": "invalid connection"

辛苦看下 stop-task,start-task 是否可以解决。这边

这边看是执行 SHOW COLUMNS FROM hdb_broker.client_building_relation 出现无效链接。

还一直如此
image

上面 log tar gz 中看到有重复键值的错误,新上传下本次报错的 query-status 和 dm-worker log 吧。

[2020/07/29 10:06:45.254 +08:00] [ERROR] [db.go:295] [“execute statements failed after retry”] [task=client_building_visited] [unit=“binlog replication”] [queries=“[INSERT INTO dm_meta.client_building_visited_syncer_checkpoint (id, cp_schema, cp_table, binlog_name, binlog_pos, is_global) VALUES(?,?,?,?,?,?) ON DUPLICATE KEY UPDATE binlog_name=?, binlog_pos=?]”] [arguments=“[[mysql-replica-07 mysql-bin|000001.003373 400543443 true mysql-bin|000001.003373 400543443]]”] [error=“[code=10006:class=database:scope=not-set:level=high] execute statement failed: begin: invalid connection”]

目前的话,是client_building_relation、client_building_visited2个任务都是一样的现象dm-work.log (279.0 KB)

query-status client_building_relation
{
“result”: true,
“msg”: “”,
“workers”: [
{
“result”: true,
“worker”: “xxx:8263”,
“msg”: “”,
“subTaskStatus”: [
{
“name”: “client_building_relation”,
“stage”: “Paused”,
“unit”: “Sync”,
“result”: {
“isCanceled”: false,
“errors”: [
{
“Type”: “UnknownError”,
“msg”: “”,
“error”: {
“ErrCode”: 10006,
“ErrClass”: 1,
“ErrScope”: 0,
“ErrLevel”: 3,
“Message”: “flush checkpoint (mysql-bin|000001.003383, 245703543)(flushed (mysql-bin|000001.003383, 245703543)): execute statement failed: begin: invalid connection”,
“RawCause”: “invalid connection”
}
}
],
“detail”: null
},
“unresolvedDDLLockID”: “”,
“sync”: {
“totalEvents”: “2230073”,
“totalTps”: “37167”,
“recentTps”: “0”,
“masterBinlog”: “(mysql-bin.003383, 245835909)”,
“masterBinlogGtid”: “56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-1327747,56fd8730-aa4f-11ea-a7c1-fa3283cd2000:1”,
“syncerBinlog”: “(mysql-bin|000001.003383, 245703543)”,
“syncerBinlogGtid”: “”,
“blockingDDLs”: [
],
“unresolvedGroups”: [
],
“synced”: false
}
}
],
“relayStatus”: {
“masterBinlog”: “(mysql-bin.003383, 245835909)”,
“masterBinlogGtid”: “56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-1327747,56fd8730-aa4f-11ea-a7c1-fa3283cd2000:1”,
“relaySubDir”: “56fd8730-aa4f-11ea-a7c1-fa3283cd2000.000001”,
“relayBinlog”: “(mysql-bin.003383, 245835909)”,
“relayBinlogGtid”: “56b5a383-aa4f-11ea-a59e-fa2c42ff9900:1-775208”,
“relayCatchUpMaster”: true,
“stage”: “Running”,
“result”: null
},
“sourceID”: “mysql-replica-07”
},
{
“result”: true,
“worker”: “xxx:8262”,
“msg”: “”,
“subTaskStatus”: [
{
“name”: “client_building_relation”,
“stage”: “Paused”,
“unit”: “Sync”,
“result”: {
“isCanceled”: false,
“errors”: [
{
“Type”: “UnknownError”,
“msg”: “”,
“error”: {
“ErrCode”: 10006,
“ErrClass”: 1,
“ErrScope”: 0,
“ErrLevel”: 3,
“Message”: “flush checkpoint (mysql-bin|000001.003837, 328497462)(flushed (mysql-bin|000001.003837, 328497462)): execute statement failed: begin: invalid connection”,
“RawCause”: “invalid connection”
}
}
],
“detail”: null
},
“unresolvedDDLLockID”: “”,
“sync”: {
“totalEvents”: “2022205”,
“totalTps”: “22468”,
“recentTps”: “0”,
“masterBinlog”: “(mysql-bin.003837, 328603655)”,
“masterBinlogGtid”: “56e32b6c-aa4f-11ea-a507-fa8817860200:1,57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1570294”,
“syncerBinlog”: “(mysql-bin|000001.003837, 328497462)”,
“syncerBinlogGtid”: “”,
“blockingDDLs”: [
],
“unresolvedGroups”: [
],
“synced”: false
}
}
],
“relayStatus”: {
“masterBinlog”: “(mysql-bin.003837, 328603655)”,
“masterBinlogGtid”: “56e32b6c-aa4f-11ea-a507-fa8817860200:1,57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1570294”,
“relaySubDir”: “56e32b6c-aa4f-11ea-a507-fa8817860200.000001”,
“relayBinlog”: “(mysql-bin.003837, 328603655)”,
“relayBinlogGtid”: “57136ecc-aa4f-11ea-b12b-fad8cb202900:1-1006962”,
“relayCatchUpMaster”: true,
“stage”: “Running”,
“result”: null
},
“sourceID”: “mysql-replica-06”
}
]
}

为啥感觉是网路问题呢。在文档的 FAQ 中对于 invalid connection 的定义也是类似的,

辛苦再检查下 tidb-server 是否有发生重启 并且 tikv-server 是否有 tikv is busy 的 log 呢?

https://pingcap.com/docs-cn/tidb-data-migration/stable/error-handling/

tidb没有重启过;也搜了tikv最近半小时所有级别的日志,也没有 busy 关键字。


image

invalid connection 的问题属于可以自动恢复的错误,看日志数据也在同步

搜关键字 “dispatch auto resume task” 可以看到 client_building_relation 和 client_building_visited,平均 5 分钟左右触发一次该错误,是不是下游有什么超时设置,导致5分钟断开一次连接?:thinking:建议沿路排查一下设置。包括

  1. TiDB wait timeout 和 interactive_timeout
  2. Linux 服务器内核参数中 tcp 超时时间的设置
  3. 链路上是否有 HAProxy 之类的代理设置
  1. TiDB wait timeout 和 interactive_timeout
    – 在系统变量中并没有看到TiDB wait timeout ,只看到这个tidb_wait_split_region_timeout感觉比较接近。
    interactive_timeout | 28800
    tidb_wait_split_region_timeout | 300

  2. Linux 服务器内核参数中 tcp 超时时间的设置
    – net.ipv4.tcp_syn_retries = 6 #timeout是127秒

  3. 链路上是否有 HAProxy 之类的代理设置
    – 应该是没有这类代理的

各位大哥,还能抢救吗?:sob:

show variables like ‘interactive_timeout’ ;

show variables like ‘wait_timeout’ ;

这个需要一个确定的答复。

| interactive_timeout | 28800 |
| wait_timeout | 28800 |

链路的话,我们task任务里面的host ip ,用的不是tidb的ip,而是一个负载均衡的ip。是否和这个有关?

可以说明下负载均衡插件是什么。

负载均衡的 time out 尝试调整的大一点,或者直连 tidb

用的是云lvx。 我先用直连的方法在重试,先观察半小时,晚点反馈情况

好的,这边有实践,LB 的连接参数问题导致连接波动,这个根据需要可以调整下