【BUG】max函数和json_extract函数一起用的时候,取值不稳定

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:3.0.3
  • 【问题描述】:埋点存的JSON数据。取一个int的最大值。语句如下

select date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,max(JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)) as max_played_seconds,max(created_at) as max_created_at from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214 GROUP BY date,hour,lib,remark,distinct_id,video_id order by date,hour,lib,remark,distinct_id,video_id

其中,max的created_at结果稳定,多次取结果一样。但是从json_extract里取出来的,就不行,取回来的值既不是最大,也不稳定,每次取结果不同。考虑到两个函数嵌套不稳定的情况。我又换了种写法,包一层在外面取,语句如下:

select date,hour,lib,remark,distinct_id,video_id,max(played_seconds) as max_played_seconds,max(created_at) as max_created_at from (select created_at,date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’) as played_seconds from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214)t1 GROUP BY date,hour,lib,remark,distinct_id,video_id order by date,hour,lib,remark,distinct_id,video_id

结果跟第一个语句一样。max(json_extract())的值不稳定。

求解决方案。

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出打印结果,请务必全选并复制粘贴上传。

您好: 请上传完整的建表语句和测试数据,我们先复现一下问题,多谢

样例数据.part001.rar (4 MB) 样例数据.part002.rar (3.6 MB) CREATE TABLE tvcbook (
track_id bigint(17) DEFAULT NULL,
distinct_id varchar(64) DEFAULT NULL,
lib varchar(255) DEFAULT NULL,
event varchar(255) DEFAULT NULL,
type varchar(255) DEFAULT NULL,
all_json json DEFAULT NULL,
host varchar(255) DEFAULT NULL,
user_agent varchar(2048) DEFAULT NULL,
ua_platform varchar(1024) DEFAULT NULL,
ua_browser varchar(1024) DEFAULT NULL,
ua_version varchar(1024) DEFAULT NULL,
ua_language varchar(1024) DEFAULT NULL,
connection varchar(255) DEFAULT NULL,
pragma varchar(255) DEFAULT NULL,
cache_control varchar(255) DEFAULT NULL,
accept varchar(255) DEFAULT NULL,
accept_encoding varchar(255) DEFAULT NULL,
accept_language varchar(255) DEFAULT NULL,
ip varchar(512) DEFAULT NULL,
ip_city json DEFAULT NULL,
ip_asn json DEFAULT NULL,
url text DEFAULT NULL,
referrer varchar(2048) DEFAULT NULL,
remark varchar(255) DEFAULT NULL,
created_at int(11) DEFAULT NULL,
date date DEFAULT NULL,
hour int(2) DEFAULT NULL,
KEY date (date),
KEY distinct_id (distinct_id),
KEY event (event),
KEY date_hour (date,hour),
KEY event_date (event,date),
KEY event_remark_date (event,remark,date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

上面是建表语句和样例数据

我的主要应用场景是根据最大的时间点和播放时间,找出不要的数据标记删掉,已减小数据库尺寸,但是因为这个BUG,所以无法实现。我们的完整语句如下。
select d1.*,if(d2.max_created_at is null,‘del’,‘max’) as statu from
(select track_id,created_at,date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’) as played_seconds from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214)d1
left join
(select date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,max(JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)) as max_played_seconds,max(created_at) as max_created_at from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214 GROUP BY date,hour,lib,remark,distinct_id,video_id)d2
on d1.date=d2.date and d1.hour=d2.hour and d1.lib=d2.lib and d1.remark=d2.remark and d1.distinct_id=d2.distinct_id and cast(d1.played_seconds as signed)=cast(d2.max_played_seconds as signed) and d1.created_at=d2.max_created_at
order by d1.date,d1.hour,d1.lib,d1.remark,d1.distinct_id,d1.video_id,d1.created_at

同步测试进度: 在需要取最大值的浮点字段*100000。即不使用浮点位,全部放大成整数。取值就比较稳定了。

如果使用浮点位。mysql取的也是错误的最大值,多数为非0的第一个值,非常稳定。tidb取出的值不稳定,甚至取到不存在的值。

能否考虑将这个字段单独存储,不放到 json 里面?

不能,这个是埋点系统。不可能一个事件就一张表的,要有兼容性和扩展性。不过我现在已经绕开这个BUG了。

好的,了解下绕过方式是改写 SQL 吗,还是在应用端做了处理呢

改写SQL了。最后写的非常复杂,但是成功实现了功能。 delete tvcbook from tvcbook join (select r1.track_id as track_id,r1.date as date,r1.hour as hour,r1.lib as lib,r1.remark as remark ,r1.distinct_id as distinct_id,r1.video_id as video_id,r1.played_seconds as played_seconds ,r1.created_at as created_at,if(r2.maxsta is not null,1,0) as need from (select track_id,date,hour,lib,remark,distinct_id,video_id,played_seconds,created_at,max(statu) as sta from (select d1.*,case when d2.max_created_at=d1.created_at and d2.max_played_seconds=d1.played_seconds then 4 when d2.max_played_seconds=d1.played_seconds and d2.max_created_at-d1.created_at<60 then 3 when d2.max_played_seconds=d1.played_seconds then 2 when d2.max_created_at=d1.created_at then 1 else 0 end as statu from (select track_id,created_at,date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)*1000000 as played_seconds from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214)d1 left join (select date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,max(JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)1000000) as max_played_seconds,max(created_at) as max_created_at from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214 GROUP BY date,hour,lib,remark,distinct_id,video_id)d2 on d1.date=d2.date and d1.hour=d2.hour and d1.lib=d2.lib and d1.remark=d2.remark and d1.distinct_id=d2.distinct_id )tt GROUP BY track_id,date,hour,lib,remark,distinct_id,video_id,played_seconds,created_at)r1 left join (select date,hour,lib,remark,distinct_id,video_id,max(sta) as maxsta from (select date,hour,lib,remark,distinct_id,video_id,played_seconds,created_at,max(statu) as sta from (select d1.,case when d2.max_created_at=d1.created_at and d2.max_played_seconds=d1.played_seconds then 4 when d2.max_played_seconds=d1.played_seconds and d2.max_created_at-d1.created_at<60 then 3 when d2.max_played_seconds=d1.played_seconds then 2 when d2.max_created_at=d1.created_at then 1 else 0 end as statu from (select track_id,created_at,date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)*1000000 as played_seconds from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214)d1 left join (select date,hour,lib,remark,distinct_id,JSON_EXTRACT(all_json, ‘$.“properties”.“video_id”’) as video_id,max(JSON_EXTRACT(all_json, ‘$.“properties”.“video_played_seconds”’)*1000000) as max_played_seconds,max(created_at) as max_created_at from tvcbook where date=‘2020-03-05’ and hour=16 and JSON_EXTRACT(all_json, ‘$.“properties”.“action”’)=214 GROUP BY date,hour,lib,remark,distinct_id,video_id)d2 on d1.date=d2.date and d1.hour=d2.hour and d1.lib=d2.lib and d1.remark=d2.remark and d1.distinct_id=d2.distinct_id )tt GROUP BY date,hour,lib,remark,distinct_id,video_id,played_seconds,created_at)r1 GROUP BY date,hour,lib,remark,distinct_id,video_id order by date,hour,lib,remark,distinct_id,video_id)r2 on r1.date=r2.date and r1.hour=r2.hour and r1.lib=r2.lib and r1.remark=r2.remark and r1.distinct_id=r2.distinct_id and r1.video_id=r2.video_id and r1.sta=r2.maxsta having need = 0 ) list where tvcbook.track_id=list.track_id and tvcbook.date=list.date and tvcbook.hour=list.hour and tvcbook.lib=list.lib and tvcbook.remark=list.remark and tvcbook.distinct_id=list.distinct_id

应用端你们也可以拿去做测试,我们用tidb做了埋点仓库,算是生产实践吧
https://github.com/white-shiro-bai/ghost_sa

:+1::+1: