向量相似性#
向量(也称为“嵌入”)表示 AI 模型对文本、图像、音频、视频等非结构化数据的印象(或理解)。向量相似性搜索 (VSS) 是在向量数据库中查找与给定查询向量相似的向量的过程。流行的 VSS 用例包括推荐系统、图像和视频搜索、文档检索和问答。
索引创建#
在进行向量搜索之前,首先定义模式并创建索引。
[1]:
import redis
from redis.commands.search.field import TagField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
r = redis.Redis(host="localhost", port=6379)
INDEX_NAME = "index" # Vector Index Name
DOC_PREFIX = "doc:" # RediSearch Key Prefix for the Index
def create_index(vector_dimensions: int):
try:
# check to see if index exists
r.ft(INDEX_NAME).info()
print("Index already exists!")
except:
# schema
schema = (
TagField("tag"), # Tag Field Name
VectorField("vector", # Vector Field Name
"FLAT", { # Vector Index Type: FLAT or HNSW
"TYPE": "FLOAT32", # FLOAT32 or FLOAT64
"DIM": vector_dimensions, # Number of Vector Dimensions
"DISTANCE_METRIC": "COSINE", # Vector Search Distance Metric
}
),
)
# index Definition
definition = IndexDefinition(prefix=[DOC_PREFIX], index_type=IndexType.HASH)
# create Index
r.ft(INDEX_NAME).create_index(fields=schema, definition=definition)
我们将从处理具有 1536 维的向量开始。
[2]:
# define vector dimensions
VECTOR_DIMENSIONS = 1536
# create the index
create_index(vector_dimensions=VECTOR_DIMENSIONS)
将向量添加到 Redis#
接下来,我们使用 hset
将向量(虚拟数据)添加到 Redis。搜索索引监听键空间通知,并将包含以 DOC_PREFIX
为前缀的任何写入的 HASH 对象。
[ ]:
%pip install numpy
[3]:
import numpy as np
[4]:
# instantiate a redis pipeline
pipe = r.pipeline()
# define some dummy data
objects = [
{"name": "a", "tag": "foo"},
{"name": "b", "tag": "foo"},
{"name": "c", "tag": "bar"},
]
# write data
for obj in objects:
# define key
key = f"doc:{obj['name']}"
# create a random "dummy" vector
obj["vector"] = np.random.rand(VECTOR_DIMENSIONS).astype(np.float32).tobytes()
# HSET
pipe.hset(key, mapping=obj)
res = pipe.execute()
搜索#
您可以使用 .ft(...).search(...)
查询命令执行 VSS 查询。要使用 VSS 查询,您必须指定选项 .dialect(2)
。
Redis 中支持两种类型的向量查询:KNN
和 Range
。 Hybrid
查询可以在两种设置中工作,并结合传统搜索和 VSS 的元素。
KNN 查询#
KNN 查询用于在给定查询向量的情况下查找前 K 个最相似的向量。
[5]:
query = (
Query("*=>[KNN 2 @vector $vec as score]")
.sort_by("score")
.return_fields("id", "score")
.paging(0, 2)
.dialect(2)
)
query_params = {
"vec": np.random.rand(VECTOR_DIMENSIONS).astype(np.float32).tobytes()
}
r.ft(INDEX_NAME).search(query, query_params).docs
[5]:
[Document {'id': 'doc:b', 'payload': None, 'score': '0.2376562953'},
Document {'id': 'doc:c', 'payload': None, 'score': '0.240063905716'}]
范围查询#
范围查询提供了一种方法,可以根据某个预定义的阈值(半径)来过滤 Redis 中向量字段与查询向量之间的距离的结果。
[6]:
query = (
Query("@vector:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}")
.sort_by("score")
.return_fields("id", "score")
.paging(0, 3)
.dialect(2)
)
# Find all vectors within 0.8 of the query vector
query_params = {
"radius": 0.8,
"vec": np.random.rand(VECTOR_DIMENSIONS).astype(np.float32).tobytes()
}
r.ft(INDEX_NAME).search(query, query_params).docs
[6]:
[Document {'id': 'doc:a', 'payload': None, 'score': '0.243115246296'},
Document {'id': 'doc:c', 'payload': None, 'score': '0.24981123209'},
Document {'id': 'doc:b', 'payload': None, 'score': '0.251443207264'}]
请参阅 此 Jupyter 笔记本 中的其他范围查询示例。
混合查询#
混合查询在一个 Redis 命令中包含传统过滤器(数字、标签、文本)和 VSS。
[7]:
query = (
Query("(@tag:{ foo })=>[KNN 2 @vector $vec as score]")
.sort_by("score")
.return_fields("id", "tag", "score")
.paging(0, 2)
.dialect(2)
)
query_params = {
"vec": np.random.rand(VECTOR_DIMENSIONS).astype(np.float32).tobytes()
}
r.ft(INDEX_NAME).search(query, query_params).docs
[7]:
[Document {'id': 'doc:b', 'payload': None, 'score': '0.24422544241', 'tag': 'foo'},
Document {'id': 'doc:a', 'payload': None, 'score': '0.259926855564', 'tag': 'foo'}]
请参阅 此 Jupyter 笔记本 中的其他混合查询示例。
向量创建和存储示例#
以上示例使用虚拟数据作为向量。但是,在现实中,大多数用例利用生产级 AI 模型来创建嵌入。下面我们将获取一些示例文本数据,分别将其传递给 OpenAI 和 Cohere API,然后将其写入 Redis。
[8]:
texts = [
"Today is a really great day!",
"The dog next door barks really loudly.",
"My cat escaped and got out before I could close the door.",
"It's supposed to rain and thunder tomorrow."
]
OpenAI 嵌入#
在使用 OpenAI Embeddings 之前,我们需要清理现有的搜索索引并创建一个新的索引。
[9]:
# delete index
r.ft(INDEX_NAME).dropindex(delete_documents=True)
# make a new one
create_index(vector_dimensions=VECTOR_DIMENSIONS)
[ ]:
%pip install openai
[10]:
import openai
# set your OpenAI API key - get one at https://platform.openai.com
openai.api_key = "YOUR OPENAI API KEY"
[11]:
# Create Embeddings with OpenAI text-embedding-ada-002
# https://openai.com/blog/new-and-improved-embedding-model
response = openai.Embedding.create(input=texts, engine="text-embedding-ada-002")
embeddings = np.array([r["embedding"] for r in response["data"]], dtype=np.float32)
# Write to Redis
pipe = r.pipeline()
for i, embedding in enumerate(embeddings):
pipe.hset(f"doc:{i}", mapping = {
"vector": embedding.tobytes(),
"content": texts[i],
"tag": "openai"
})
res = pipe.execute()
[12]:
embeddings
[12]:
array([[ 0.00509819, 0.0010873 , -0.00228475, ..., -0.00457579,
0.01329307, -0.03167175],
[-0.00357223, -0.00550784, -0.01314328, ..., -0.02915693,
0.01470436, -0.01367203],
[-0.01284631, 0.0034875 , -0.01719686, ..., -0.01537451,
0.01953256, -0.05048691],
[-0.01145045, -0.00785481, 0.00206323, ..., -0.02070181,
-0.01629098, -0.00300795]], dtype=float32)
使用 OpenAI Embeddings 进行搜索#
现在我们已经使用 OpenAI 创建了嵌入,我们也可以执行搜索以找到与某些输入文本相关的文档。
[13]:
text = "animals"
# create query embedding
response = openai.Embedding.create(input=[text], engine="text-embedding-ada-002")
query_embedding = np.array([r["embedding"] for r in response["data"]], dtype=np.float32)[0]
query_embedding
[13]:
array([ 0.00062901, -0.0070723 , -0.00148926, ..., -0.01904645,
-0.00436092, -0.01117944], dtype=float32)
[14]:
# query for similar documents that have the openai tag
query = (
Query("(@tag:{ openai })=>[KNN 2 @vector $vec as score]")
.sort_by("score")
.return_fields("content", "tag", "score")
.paging(0, 2)
.dialect(2)
)
query_params = {"vec": query_embedding.tobytes()}
r.ft(INDEX_NAME).search(query, query_params).docs
# the two pieces of content related to animals are returned
[14]:
[Document {'id': 'doc:1', 'payload': None, 'score': '0.214349985123', 'content': 'The dog next door barks really loudly.', 'tag': 'openai'},
Document {'id': 'doc:2', 'payload': None, 'score': '0.237052619457', 'content': 'My cat escaped and got out before I could close the door.', 'tag': 'openai'}]
Cohere Embeddings#
在使用 Cohere Embeddings 之前,我们需要清理现有的搜索索引并创建一个新的索引。
[15]:
# delete index
r.ft(INDEX_NAME).dropindex(delete_documents=True)
# make a new one for cohere embeddings (1024 dimensions)
VECTOR_DIMENSIONS = 1024
create_index(vector_dimensions=VECTOR_DIMENSIONS)
[ ]:
%pip install cohere
[16]:
import cohere
co = cohere.Client("YOUR COHERE API KEY")
[17]:
# Create Embeddings with Cohere
# https://docs.cohere.ai/docs/embeddings
response = co.embed(texts=texts, model="small")
embeddings = np.array(response.embeddings, dtype=np.float32)
# Write to Redis
for i, embedding in enumerate(embeddings):
r.hset(f"doc:{i}", mapping = {
"vector": embedding.tobytes(),
"content": texts[i],
"tag": "cohere"
})
[18]:
embeddings
[18]:
array([[-0.3034668 , -0.71533203, -0.2836914 , ..., 0.81152344,
1.0253906 , -0.8095703 ],
[-0.02560425, -1.4912109 , 0.24267578, ..., -0.89746094,
0.15625 , -3.203125 ],
[ 0.10125732, 0.7246094 , -0.29516602, ..., -1.9638672 ,
1.6630859 , -0.23291016],
[-2.09375 , 0.8588867 , -0.23352051, ..., -0.01541138,
0.17053223, -3.4042969 ]], dtype=float32)
使用 Cohere Embeddings 进行搜索#
现在我们已经使用 Cohere 创建了嵌入,我们也可以执行搜索以找到与某些输入文本相关的文档。
[19]:
text = "animals"
# create query embedding
response = co.embed(texts=[text], model="small")
query_embedding = np.array(response.embeddings[0], dtype=np.float32)
query_embedding
[19]:
array([-0.49682617, 1.7070312 , 0.3466797 , ..., 0.58984375,
0.1060791 , -2.9023438 ], dtype=float32)
[20]:
# query for similar documents that have the cohere tag
query = (
Query("(@tag:{ cohere })=>[KNN 2 @vector $vec as score]")
.sort_by("score")
.return_fields("content", "tag", "score")
.paging(0, 2)
.dialect(2)
)
query_params = {"vec": query_embedding.tobytes()}
r.ft(INDEX_NAME).search(query, query_params).docs
# the two pieces of content related to animals are returned
[20]:
[Document {'id': 'doc:1', 'payload': None, 'score': '0.658673524857', 'content': 'The dog next door barks really loudly.', 'tag': 'cohere'},
Document {'id': 'doc:2', 'payload': None, 'score': '0.662699103355', 'content': 'My cat escaped and got out before I could close the door.', 'tag': 'cohere'}]
在 这个 GitHub 组织 中找到更多使用 Redis 向量相似性搜索的示例应用程序、教程和项目。