tidb-server 在大并发下重启的优化思路

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0.1
  • 【问题描述】:tidb-server大并发下重启

查看日志是执行了一条sql后重启的:
[2020/07/10 15:28:35.218 +08:00] [ERROR] [conn.go:728] [“command dispatched failed”] [conn=1173] [connInfo=“id:1173, addr:10.1.2.114:52602 status:10, collation:utf8_general_ci, user:guorou”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql="\ SELECT \ temp.*,\ dense_rank () over (order by temp.num desc) as rank_no\ FROM\ (SELECT student_id,SUM(num) num ,u.username ,u. avatar_oss_id avatar_url\ FROM\ (SELECT student_id, SUM(num) num \ FROM coin_log \ WHERE clazz_plan_id = ‘75800’\ GROUP BY student_id\ UNION ALL\ SELECT student_id, SUM(-num) num \ FROM \ coin_consume_log \ WHERE clazz_plan_id = ‘75800’ AND type NOT IN (‘26’,‘42’)\ GROUP BY student_id\ ) as a\ LEFT JOIN users u ON u.id =a.student_id\ GROUP BY student_id\ ORDER BY num DESC LIMIT 0,10) temp\ “] [txn_mode=PESSIMISTIC] [err=”[tikv:9007]Execution terminated due to exceeding the deadline\ngithub.com/pingcap/errors.AddStack\ \t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174\ github.com/pingcap/errors.Trace\ \t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/juju_adaptor.go:15\ github.com/pingcap/tidb/executor.(*indexWorker).extractTaskHandles\ \t/home/jenkins/agent/workspace/build-linux-amd64-4.0-githash/go/src/github.com/pingcap/tidb/executor/distsql.go:713\ngithub.com/pingcap/tidb/executor.(*indexWorker).fetchHandles\ \t/home/jenkins/agent/workspace/build-linux-amd64-4.0-githash/go/src/github.com/pingcap/tidb/executor/distsql.go:670\ngithub.com/pingcap/tidb/executor.(*IndexLookUpExecutor).startIndexWorker.func1\ \t/home/jenkins/agent/workspace/build-linux-amd64-4.0-githash/go/src/github.com/pingcap/tidb/executor/distsql.go:485\ runtime.goexit\ \t/usr/local/go/src/runtime/asm_amd64.s:1357"]
[2020/07/10 15:29:51.707 +08:00] [INFO] [printer.go:42] [“Welcome to TiDB.”] [“Release Version”=v4.0.1] [Edition=Community] [“Git Commit Hash”=689a6b6439ae7835947fcaccf329a3fc303986cb] [“Git Branch”=heads/v4.0.1] [“UTC Build Time”=“2020-06-15 03:54:18”] [GoVersion=go1.13] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出打印结果,请务必全选并复制粘贴上传。

  1. 麻烦 看下 tidb_stderr.log 日志中是否有信息
  2. grep -i memory /var/log/messages 看下是否有 OOM 的情况

1.没有

2.内存溢出了
Jul 10 15:29:35 VM_8_2_centos kernel: [] out_of_memory+0x4b6/0x4f0
Jul 10 15:29:35 VM_8_2_centos kernel: Out of memory: Kill process 10511 (tidb-server) score 978 or sacrifice child
这个有办法避免吗?

  1. 可以设置 mem-quota-query 和 oom-action 的参数,当 SQL 内存使用达到 mem-quota-query 之后,采取 oom-action 参数设置的操作,oom-action 设置为 log 的话,会将 SQL 记录在日志中并继续执行,oom-action 设置为 cancel 的话,会中断 SQL 的执行
  2. 4.0 之后可以通过设置 oom-use-tmp-storage 和 tmp-storage-path 参数,当 SQL 达到 mem-quota-query 之后,会将某些算子的内存落到临时磁盘上

这个是单条sql,如果高并发场景刚好没达到阈值,多条的sql也有可能让tidb-server oom吧?是否通过总内存/连接数来设mem-quota-query可以达到不会oom的效果?

这个目前只能通过控制连接数的方式来控制,因为对于单个连接内的 SQL 是串行处理的,所以控制连接数据之后,只要 连接数*mem-quota-query 不会超过内存使用的话,应该就不会导致 tidb-server oom 了

tidb 没有连接数的限制,可以但是有并发数量的限制,并发数量可以通过 token-limit 参数进行控制,默认是 1000
image

我们的应用是弹性伸缩的,无法估算精确的连接数。有没有其他的方法阻止oom?

  1. 扩展 tidb,将连接分散到不同的tidb,降低内存使用
  2. 参考上面的 token-limit,限制参数的大小,减少同时并发的连接。超过参数的连接会等待。

第2点刚试过了没用,我设置了
mem-quota-query: 56636760
oom-use-tmp-storage: true
prepared-plan-cache.enabled: true
tikv-client.copr-cache.enable: true
tikv-client.grpc-connection-count: 64
tmp-storage-path: /tmp/tidb_tmp-storage
token-limit: 1000
但是高并发的时候还是会重启。

第1不能解决问题,因为肯定会存在重启之后我们才发现要加机器,如果突然来了流量tidb-server已经重启了,加了机器也不清楚能否抗住下一次的流量。所以想找一种比较靠谱的保护机制来防止tidb重启,当收到警告后我们可以迅速增加机器分流。

  1. token-limit: 1000 默认是 1000, 可以减少到几百试试
  2. 能否反馈下 问题开始到 OOM这段时间的 tidb.log 日志 和 slow log 日志,看下有没有哪个sql占用的内存较多,多谢。

有。

SELECT
temp.*,
dense_rank () over (
ORDER BY
temp.num DESC
) AS rank_no
FROM
(
SELECT
student_id,
SUM(num) num,
u.username,
u.avatar_oss_id avatar_url
FROM
(
SELECT
student_id,
SUM(num) num
FROM
coin_log
WHERE
clazz_plan_id = ‘75800’
GROUP BY
student_id
UNION
ALL
SELECT
student_id,
SUM(- num) num
FROM
coin_consume_log
WHERE
clazz_plan_id = ‘75800’
AND TYPE NOT IN (‘26’, ‘42’)
GROUP BY
student_id
) AS a
LEFT JOIN users u ON u.id = a.student_id
GROUP BY
student_id
ORDER BY
num DESC
LIMIT
0, 10
) temp;

执行计划为:

Projection_23 10.00 10 root time:378.037715ms, loops:2, Concurrency:OFF Column#27, Column#70, course.users.username, course.users.avatar_oss_id, Column#72 3.21875 KB N/A
└─Window_24 10.00 10 root time:378.033621ms, loops:2 dense_rank()->Column#72 over(order by Column#70 desc) N/A N/A
└─TopN_27 10.00 10 root time:378.003844ms, loops:2 Column#70:desc, offset:0, count:10 120.953125 KB N/A
└─HashAgg_39 34236.82 31710 root time:376.049611ms, loops:33, PartialConcurrency:4, FinalConcurrency:4 group by:Column#27, funcs:sum(Column#28)->Column#70, funcs:firstrow(Column#27)->Column#27, funcs:firstrow(course.users.username)->course.users.username, funcs:firstrow(course.users.avatar_oss_id)->course.users.avatar_oss_id 903.625 KB N/A
└─IndexMergeJoin_49 34236.82 31710 root time:356.886807ms, loops:33, Concurrency:OFF left outer join, inner:Projection_47, outer key:Column#27, inner key:course.users.id 1.5016155242919922 MB N/A
├─Union_54(Build) 34236.82 31710 root time:217.283145ms, loops:33 N/A N/A
│ ├─Projection_55 34235.82 31710 root time:217.181236ms, loops:33, Concurrency:4 course.coin_log.student_id, Column#26 299.515625 KB N/A
│ │ └─HashAgg_63 34235.82 31710 root time:217.370708ms, loops:33, PartialConcurrency:4, FinalConcurrency:4 group by:course.coin_log.student_id, funcs:sum(Column#76)->Column#26, funcs:firstrow(course.coin_log.student_id)->course.coin_log.student_id 9.128448486328125 MB N/A
│ │ └─TableReader_64 34235.82 198006 root time:189.389741ms, loops:9, rpc num: 8, rpc max:196.945704ms, min:102.275265ms, avg:138.294936ms, p80:193.111173ms, p95:196.945704ms, proc keys max:93308, p95:93308 data:HashAgg_56 3.402571678161621 MB N/A
│ │ └─HashAgg_56 34235.82 198006 cop[tikv] proc max:118ms, min:73ms, p80:110ms, p95:118ms, iters:729, tasks:8 group by:course.coin_log.student_id, funcs:sum(course.coin_log.num)->Column#76 N/A N/A
│ │ └─Selection_62 681304.50 694172 cop[tikv] proc max:95ms, min:51ms, p80:91ms, p95:95ms, iters:729, tasks:8 eq(course.coin_log.clazz_plan_id, 75800) N/A N/A
│ │ └─TableFullScan_61 745308.00 742274 cop[tikv] table:coin_log proc max:94ms, min:50ms, p80:86ms, p95:94ms, iters:729, tasks:8 keep order:false N/A N/A
│ └─Projection_75 1.00 0 root time:946.458µs, loops:1, Concurrency:OFF course.coin_consume_log.student_id, Column#13 1.7578125 KB N/A
│ └─HashAgg_78 1.00 0 root time:938.3µs, loops:1, PartialConcurrency:4, FinalConcurrency:4 group by:Column#86, funcs:sum(Column#84)->Column#13, funcs:firstrow(Column#85)->course.coin_consume_log.student_id 16.96875 KB N/A
│ └─Projection_91 0.33 0 root time:880.378µs, loops:1, Concurrency:OFF cast(unaryminus(course.coin_consume_log.num), decimal(65,0) BINARY)->Column#84, course.coin_consume_log.student_id, course.coin_consume_log.student_id 1.453125 KB N/A
│ └─TableReader_87 0.33 0 root time:873.115µs, loops:1, rpc num: 8, rpc max:1.944394ms, min:1.55688ms, avg:1.764651ms, p80:1.910211ms, p95:1.944394ms, proc keys max:128, p95:128 data:Selection_86 208 Bytes N/A
│ └─Selection_86 0.33 0 cop[tikv] proc max:1ms, min:0s, p80:0s, p95:1ms, iters:23, tasks:8 eq(course.coin_consume_log.clazz_plan_id, 75800), not(in(course.coin_consume_log.type, 26, 42)) N/A N/A
│ └─TableFullScan_85 871.00 871 cop[tikv] table:coin_consume_log proc max:1ms, min:0s, p80:0s, p95:1ms, iters:23, tasks:8 keep order:false, stats:pseudo N/A N/A
└─Projection_47(Probe) 1.00 31710 root time:394.182102ms, loops:39, Concurrency:OFF course.users.id, course.users.username, course.users.avatar_oss_id 82.171875 KB N/A
└─IndexLookUp_46 1.00 31710 root time:394.097703ms, loops:39, rpc num: 358, rpc max:14.432859ms, min:393.881µs, avg:2.133749ms, p80:2.287549ms, p95:6.928328ms, proc keys max:1951, p95:593 110.171875 KB N/A
├─IndexRangeScan_44(Build) 1.00 31710 cop[tikv] table:u, index:PRIMARY(id) proc max:9ms, min:0s, p80:1ms, p95:4ms, iters:587, tasks:358 range: decided by [eq(course.users.id, Column#27)], keep order:true N/A N/A
└─TableRowIDScan_45(Probe) 1.00 31710 cop[tikv] table:u proc max:12ms, min:0s, p80:2ms, p95:6ms, iters:679, tasks:357 keep order:false N/A N/A
  1. 这个sql看起来估算只有9M的消耗, 请问您的tidb-server内存有多少?
  2. 麻烦您上传日志我看下,多谢。

每台tidb-server 32g内存,总共四台。
https://test.dianbo.zycourse.com/tidb/logtidb.log

看起来有一个sql有问题,导致一直在执行。这里的 subquery 应该只能返回一行,但是返回了多行,麻烦检查下sql多谢。

[2020/07/13 23:28:18.515 +08:00] [ERROR] [conn.go:728] [“command dispatched failed”] [conn=50571] [connInfo=“id:50571, addr:10.6.0.48:58806 status:10, collation:latin1_swedish_ci, user:liaojianqiu”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“SELECT\ user.id AS id,\ user.username AS username,\ user.phone AS phone,\ user.role AS role,\ user.verify_code AS verify_code,\ user.verify_time AS verify_time,\ user.is_verified AS is_verified,\ user.avatar_oss_id AS avatar_oss_id,\ user.access_token AS access_token,\ user.is_staff AS is_staff,\ user.status AS status,\ user.school_id AS school_id,\ user.school_name AS school_name,\ user.score AS score,\ user.grade AS grade,\ user.sex AS sex,\ user.real_name AS real_name,\ user.create_time AS create_time,\ user.is_destroy AS is_destroy,\ user.destroy_time AS destroy_time,\ user.register_activity AS register_activity,\ user.register_source AS register_source,\ user.ppt_id AS ppt_id,\ user.member_id AS member_id,\ user.secret AS secret,\ user.hash AS hash,\ (select official_a.id from users_official_account official_a where user.id = official_a.user_id AND official_a.status = 1) AS official_a__id,\ (select official_a.wx_open_id from users_official_account official_a where user.id = official_a.user_id AND official_a.status = 1) AS official_a__wx_open_id,\ (select official_a.wx_username from users_official_account official_a where user.id = official_a.user_id AND official_a.status = 1) AS official_a__wx_username,\ (select official_a.wx_avatar from users_official_account official_a where user.id = official_a.user_id AND official_a.status = 1) AS official_a__wx_avatar,\ (select official_a.wx_phone from users_official_account official_a where user.id = official_a.user_id AND official_a.status = 1) AS official_a__wx_phone\ FROM\ users user \ WHERE\ 1 = 1\ AND user.role = 3\ AND user.member_id = ‘11100060923’;”] [txn_mode=OPTIMISTIC] [err="subquery returns more than 1 row

感谢帮忙这么仔细的查看,这条是我测试的sql。不是引起重启的原因。白天我特意重现过这个场景(使用我发的那条sql压测):


在这时间附近范围内的应该才是重启原因的sql。

  1. 能否麻烦您重新压测一次,当内存开始上涨时,采集火焰图信息 , 可以多采集几次,多谢
    https://docs.pingcap.com/zh/tidb/stable/dashboard-profiling#tidb-dashboard-实例性能分析页面

我看日志大多数是被oom killer kill掉的。这时候tidb实例已经重启了,很难采集到上涨时候的。出现了如下报错:


我们测试集群已经回收机器了。虽然有报错但是在压力不断的情况下我收集了三次:
profiling_pack_1 (2).tar.gz (34.9 KB) profiling_pack_2 (1).tar.gz (32 字节) profiling_pack_3.tar.gz (35.0 KB) profiling_pack_2 (1).tar.gz (32 字节)
目前还是存在并发重启的情况。

抱歉,这个采集到的,没看出来占用的内存信息,能否麻烦您参考这个帖子,采集内存的profile文件,多谢

可以在大并发,内存上涨的过程中采集。

好的,目前测试环境还留有两台tidb,我抓一下火焰图。目前就体现出的问题是一致的,sql也是有用到group by。他那个问题现在修复了吗?

profile:
heap.profile (55.7 KB)

  1. 查看这个profile感觉不大一样,在采集时内存消耗了多少?

  1. 感觉还是需要一个您这边压测时随时观察的profile,如果可以有一个快要达到OOM时的profile更好,多谢。