ai时代,tidb也推出了自己的ai函数。这个世界进步太快,我都跟不上了
mysql> CREATE TABLE vec_tbl(content_id INT, vec VECTOR);
Query OK, 0 rows affected (1.24 sec)
mysql> CREATE TABLE vec_ref_id(id INT, content VARCHAR(512));
Query OK, 0 rows affected (0.86 sec)
insert
….
TOPK
mysql> SELECT content_id, vec_cosine_distance(vec, ‘[8.7, 5.7, 7.7, 9.8, 1.4]’) as distance
→ FROM vec_tbl ORDER BY distance LIMIT 3;
±-----------±------------------------+
| content_id | distance |
±-----------±------------------------+
| 1 | 0.000018636164854313186 |
| 4 | 0.046720443178103865 |
| 5 | 0.17025852918553686 |
±-----------±------------------------+
3 rows in set (0.26 sec)
JOIN
mysql> SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, ‘[1,2,3,4,5]’) AS distance
→ FROM vec_tbl vt
→ JOIN vec_ref_id vri ON vt.content_id = vri.id
→ ORDER BY distance;
±-----±------------±--------------------+
| id | content | distance |
±-----±------------±--------------------+
| 5 | content 5 | 0.04972827044697814 |
| 3 | content 3 | 0.06925255631855243 |
| 4 | content 4 | 0.12677426832319694 |
| 2 | content 2 | 0.18577333207371582 |
| 1 | hello world | 0.2564100235012563 |
±-----±------------±--------------------+
5 rows in set (0.29 sec)
有了向量 做电影推荐 图书推荐就不需要写代码了。直接一个sql搞定。
KNN(K Nearest Neighbors)
精确从 N 维的向量中找到最相近的 K 个向量
时间复杂度是 O(NK)
ANN(Spproximate Nearest Neighbors)
近似结果,所以有召回率的问题
K 个近似最优的结果,映射到 SQL 算⼦就是
TopN
期待查询时效需要接近 TP(< 100ms)
Hash-based
LSH:多个 Hash 增加碰撞,让相似的聚集在⼀
起;⽆法适合⾼维数据
Tree-based:
KD-Tree,Annoy
Graph-based:
HNSW,DiskANN
Cluster-base
mysql> CREATE TABLE vec_tbl(content_id INT, vec VECTOR);
Query OK, 0 rows affected (1.24 sec)
mysql> CREATE TABLE vec_ref_id(id INT, content VARCHAR(512));
Query OK, 0 rows affected (0.86 sec)
insert
….
TOPK
mysql> SELECT content_id, vec_cosine_distance(vec, ‘[8.7, 5.7, 7.7, 9.8, 1.4]’) as distance
→ FROM vec_tbl ORDER BY distance LIMIT 3;
±-----------±------------------------+
| content_id | distance |
±-----------±------------------------+
| 1 | 0.000018636164854313186 |
| 4 | 0.046720443178103865 |
| 5 | 0.17025852918553686 |
±-----------±------------------------+
3 rows in set (0.26 sec)
JOIN
mysql> SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, ‘[1,2,3,4,5]’) AS distance
→ FROM vec_tbl vt
→ JOIN vec_ref_id vri ON vt.content_id = vri.id
→ ORDER BY distance;
±-----±------------±--------------------+
| id | content | distance |
±-----±------------±--------------------+
| 5 | content 5 | 0.04972827044697814 |
| 3 | content 3 | 0.06925255631855243 |
| 4 | content 4 | 0.12677426832319694 |
| 2 | content 2 | 0.18577333207371582 |
| 1 | hello world | 0.2564100235012563 |
±-----±------------±--------------------+
数据库本身是基于统计的需要。在没有机器学习前。
那时候我在做下载网站。 我们一般会在左边页面显示最近下载top10 历史下载top10 分类下载top10 这就是sql写的。有了向量函数我们可以显示喜欢该软件的也下载了xxx。然后按照啤酒尿不湿原理做向量推荐。
import numpy as np
# create user preferences
user_pref = np.array([5, 1, 3])
# create a random movie matrix of 10,000 movies
movies = np.random.randint(5,size=(3,10000))+1
# Note that the randint will make random integers from 0-4
# so I added a 1 at the end to increase the scale from 1-5
我们使用numpy模块的array函数创建矩阵,user_pref和movies是我们得到的数据。我们使用numpy模块的shape方法检查矩阵的维度,如下所示:
import time
for num_movies in (10000, 100000, 1000000, 10000000, 100000000):
movies = np.random.randint(5,size=(3, num_movies))+1
now = time.time()
np.dot(user_pref, movies)
print (time.time() - now), "seconds to run", num_movies, "movies"
输出结果:
0.000160932540894 seconds to run 10000 movies
0.00121188163757 seconds to run 100000 movies
0.0105860233307 seconds to run 1000000 movies
0.096577167511 seconds to run 10000000 movies
4.16197991371 seconds to run 100000000 movies
也就是说之前我们写推荐代码需要python,numpy 算出来排序在写入tidb 做缓存。现在一个sql就搞定了
SELECT vri.id, vri.content, VEC_Cosine_Distance(vt.vec, ‘[1,2,3,4,5]’) AS distance
FROM vec_tbl vt JOIN vec_ref_id vri ON vt.content_id = vri.id ORDER BY distance;