python脚本请求tidb偶尔会报error totally whack

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:
#./tidb-server -V
Release Version: v3.0.5
Git Commit Hash: c9000abdc216b6a02efbcc578af8be1f98ba280d
Git Branch: HEAD
UTC Build Time: 2019-10-25 03:25:32
GoVersion: go version go1.13 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
  • 【问题描述】: python脚本请求tidb偶尔会报: (-1, ‘error totally whack’)
root 09:07:  [(none)]> show variables like 'show_compatibility_56';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| show_compatibility_56 | ON    |
+-----------------------+-------+
1 row in set (0.00 sec)

info_collecting.py脚本收集信息如下: {‘tidb_log_dir’: ‘{{ deploy_dir }}/log’, ‘dummy’: None, ‘tidb_port’: 4000, ‘tidb_status_port’: 10080, ‘tidb_cert_dir’: ‘{{ deploy_dir }}/conf/ssl’}

系统信息
+-------------------------------------------------------------+----------------------------+
|                             Host                            |          Release           |
+-------------------------------------------------------------+----------------------------+
|                   iZ2ze9crs29jye4j71gs2rZ                   | 3.10.0-957.21.3.el7.x86_64 |
|                   iZ2ze9crs29jye4j71gs2qZ                   | 3.10.0-957.21.3.el7.x86_64 |
|                   iZ2ze3bugticyn9t6l2qa7Z                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2ze3bugticyn9t6l2qa9Z                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2ze3bugticyn9t6l2qaaZ                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2ze3bugticyn9t6l2qa8Z                   | 3.10.0-1062.1.2.el7.x86_64 |
| aliyun-x-x-101-88.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-89.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-90.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-91.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-92.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-93.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-94.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-95.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-96.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
| aliyun-x-x-101-98.dba-member-mysql.intra.x.com | 3.10.0-957.5.1.el7.x86_64  |
|                   iZ2ze2d8u10y3fuv5c6w0yZ                   | 3.10.0-957.5.1.el7.x86_64  |
|                   iZ2zefyawbj0bgay9j0hzrZ                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2zefyawbj0bgay9j0hzqZ                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2zefyawbj0bgay9j0hztZ                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2zefyawbj0bgay9j0hzsZ                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2ze7w3f21ixhxa2c0jc8Z                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2ze2fbqxna6nok8sw6jhZ                   | 3.10.0-957.5.1.el7.x86_64  |
|                   iZ2ze7w3f21ixhxa2c0jc7Z                   | 3.10.0-1062.1.2.el7.x86_64 |
|                   iZ2zei12ill6tpugwlbldsZ                   | 3.10.0-957.21.3.el7.x86_64 |
|                   iZ2zei12ill6tpugwlbldtZ                   | 3.10.0-957.21.3.el7.x86_64 |
+-------------------------------------------------------------+----------------------------+
TiDB 集群信息
+--------------------+--------------+------+----+------+
|    TiDB_version    | Clu_replicas | TiDB | PD | TiKV |
+--------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.5 |      3       |  6   | 3  |  17  |
+--------------------+--------------+------+----+------+
集群节点信息
+-------------+-------------+
|   Node_IP   | Server_info |
+-------------+-------------+
|  instance_8 |     tikv    |
|  instance_9 |     tikv    |
|  instance_0 |     tidb    |
| instance_10 |     tikv    |
| instance_11 |     tikv    |
| instance_12 |     tikv    |
|  instance_1 |     tikv    |
| instance_13 |     tikv    |
| instance_14 |     tikv    |
| instance_15 |     tikv    |
| instance_16 |     tikv    |
| instance_17 |     tikv    |
| instance_18 |     tikv    |
|  instance_4 |   tidb+pd   |
|  instance_5 |     tikv    |
| instance_19 |     tikv    |
| instance_20 |     tikv    |
|  instance_2 |   tidb+pd   |
| instance_21 |     tikv    |
| instance_22 |     tikv    |
|  instance_3 |   pd+tidb   |
|  instance_6 |     tidb    |
|  instance_7 |     tidb    |
+-------------+-------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
|       24538.08      |     6531.16     |    349642    |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
| 1574.29 |      7.91      |       7.97      |
+---------+----------------+-----------------+
热点 region 信息
+---------------+----------+-----------+
|     Store     | Hot_read | Hot_write |
+---------------+----------+-----------+
| store-1087487 |    0     |     0     |
| store-1005501 |    0     |     0     |
|  store-128620 |    0     |     0     |
| store-2367029 |    0     |     1     |
| store-1108426 |    0     |     0     |
| store-1087486 |    0     |     0     |
| store-2367032 |    0     |     0     |
| store-2367033 |    0     |     1     |
| store-2367030 |    0     |     0     |
| store-2367031 |    0     |     0     |
| store-2367036 |    1     |     0     |
| store-2367034 |    0     |     0     |
| store-2367035 |    0     |     0     |
|    store-1    |    0     |     0     |
| store-1132575 |    0     |     0     |
| store-1132576 |    0     |     0     |
|    store-5    |    0     |     1     |
|    store-4    |    0     |     0     |
| store-1108427 |    0     |     0     |
+---------------+----------+-----------+
磁盘延迟信息
+--------+-------------+-------------+--------------+
| Device |   Instance  | Read_lat_MS | Write_lat_MS |
+--------+-------------+-------------+--------------+
|  md0   |  instance_8 |     0.00    |     0.00     |
|  md0   |  instance_9 |     0.00    |     0.00     |
|  vda   | instance_17 |     nan     |     1.36     |
|  vda   | instance_18 |     nan     |     2.21     |
|  vda   | instance_13 |     nan     |     0.00     |
|  vda   | instance_14 |     nan     |     0.00     |
|  vda   | instance_15 |     nan     |     0.00     |
|  vda   | instance_16 |     nan     |     0.00     |
|  vda   | instance_10 |     nan     |     0.50     |
|  vda   | instance_11 |     nan     |     0.50     |
|  vda   |  instance_5 |     nan     |     7.29     |
|  vda   |  instance_4 |     nan     |     1.50     |
|  vda   |  instance_3 |     nan     |     0.25     |
|  vda   |  instance_2 |     nan     |     1.11     |
|  vda   |  instance_1 |     nan     |     0.50     |
|  vda   |  instance_0 |     nan     |    11.69     |
|  vda   | instance_22 |     nan     |     0.33     |
|  vda   | instance_21 |     nan     |     0.75     |
|  vda   | instance_20 |     nan     |     0.90     |
|  vda   | instance_19 |     nan     |     0.83     |
|  vda   |  instance_6 |     nan     |     0.50     |
|  vda   | instance_12 |     nan     |     nan      |
|  vda   |  instance_7 |     nan     |     0.67     |
|  vda   |  instance_8 |     nan     |     0.67     |
|  vda   |  instance_9 |     nan     |    19.00     |
|  vdb   | instance_17 |     1.57    |     0.49     |
|  vdb   | instance_18 |     1.59    |     0.47     |
|  vdb   | instance_13 |     0.46    |     0.03     |
|  vdb   | instance_14 |     0.19    |     0.03     |
|  vdb   | instance_15 |     0.17    |     0.03     |
|  vdb   | instance_16 |     0.20    |     0.03     |
|  vdb   | instance_10 |     0.28    |     0.05     |
|  vdb   | instance_11 |     1.03    |     0.07     |
|  vdb   |  instance_5 |     0.72    |     0.04     |
|  vdb   |  instance_4 |     nan     |     1.27     |
|  vdb   |  instance_3 |     nan     |     1.70     |
|  vdb   |  instance_2 |     nan     |     1.12     |
|  vdb   |  instance_1 |     0.15    |     0.06     |
|  vdb   |  instance_0 |     nan     |     0.73     |
|  vdb   | instance_22 |     0.17    |     0.03     |
|  vdb   | instance_21 |     0.37    |     0.04     |
|  vdb   | instance_20 |     0.17    |     0.04     |
|  vdb   | instance_19 |     0.19    |     0.03     |
|  vdb   |  instance_6 |     nan     |     0.78     |
|  vdb   | instance_12 |     0.20    |     0.03     |
|  vdb   |  instance_7 |     nan     |     0.67     |
|  vdb   |  instance_8 |     0.54    |     0.05     |
|  vdb   |  instance_9 |     0.21    |     0.04     |
|  vdc   |  instance_8 |     0.57    |     0.05     |
|  vdc   |  instance_9 |     0.24    |     0.04     |
+--------+-------------+-------------+--------------+

怀疑是GC导致的,请问有什么好解决方法吗?

[2020/01/14 06:33:51.268 +08:00] [WARN] [conn.go:682] ["dispatch error"] [conn=4594366] [connInfo="id:4594366, addr:10.100.2.91:32533 status:1, collation:latin1_swedish_ci, user:dmp_rw"] [command=Query] [status="inTxn:1, autocommit:0"] 
[sql="\
                        select count(1) from vip_service where user_uuid = 'xxxxxxx' and vip_begin_time < '2019-12-01' and vip_end_time > '2019-11-30' and created_at < '2019-11-30'\
                    "]
 [err="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2020-01-14 06:10:01.751 +0800 CST, GC safe point is 2020-01-14 06:23:14.851 +0800 CST\
github.com/pingcap/errors.AddStack\
\tgithub.com/pingcap/errors@v0.11.4/errors.go:174\
github.com/pingcap/parser/terror.(*Error).GenWithStackByArgs\
\tgithub.com/pingcap/parser@v0.0.0-20191018040038-555b97093a2a/terror/terror.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvStore).CheckVisibility\
\tgithub.com/pingcap/tidb@/store/tikv/kv.go:171\
github.com/pingcap/tidb/store/tikv.(*copIterator).Next\
\tgithub.com/pingcap/tidb@/store/tikv/coprocessor.go:623\
github.com/pingcap/tidb/distsql.(*selectResult).fetch\
\tgithub.com/pingcap/tidb@/distsql/select_result.go:103\
runtime.goexit\
\truntime/asm_amd64.s:1357"]

gc时间已经调整到比较长的时间:

| tikv_gc_run_interval     | 5h                                                                                                                                           | GC run interval, at least 10m, in Go format.                                                |
| tikv_gc_life_time        | 10m                                                                                                                                          | All versions within life time will not be collected by GC, at least 10m, in Go format.      |

1、目前还是建议调大 GC 的时间,避免出现这个问题

2、从业务侧进行调整避免跨 GC 的访问

再调大gc,也是有可能gc的吧,一个事务正好跨gc时间了的话,是吗? 业务侧如何进行避免?这个没法避免吧,现在是每5小时gc一次,业务没法确认什么时候要gc。

1、gc 相关的内容建议看下下述连接了解下:

https://pingcap.com/docs-cn/stable/reference/garbage-collection/overview/

2、上述报错是事务访问了被 gc 的数据,按照你上面的配置,每 10 分钟 gc 一次,并且保留的数据是 5h,出现这个报错是说你那里访问了上次 gc 的 safepoint 点前的数据,但是该数据已经被 gc 掉了。

3、如果想避免这个问题,可以评估下事务执行的最长时间,调大 gc 的保留时间,比如调大 life time > 5h,或者调大 interval time,能够在一定程度上减少这个问题出现的概率

4、业务侧如果可以调整,建议优化下逻辑,减少出现上述问题的现象的概率