Integration with Milvus

Certainly!

Milvus is an open-source vector database designed to handle large-scale vector data with high efficiency and scalability. It supports various indexing methods, including HNSW, and provides APIs for multiple programming languages, including Python through the PyMilvus client.

In this response, I'll guide you through the approach of using Milvus as a vector store database, including code examples. We'll cover:

  1. Introduction to Milvus

  2. Setting Up Milvus

  3. Generating Embeddings

  4. Connecting to Milvus

  5. Creating a Collection

  6. Inserting Vectors

  7. Creating Indexes

  8. Performing Searches

  9. Complete Example Code

  10. Considerations and Best Practices

  11. Conclusion


1. Introduction to Milvus

Milvus is designed for storing and searching vector embeddings efficiently. It is suitable for applications like similarity search, recommendation systems, and more.

Key Features:

  • High Performance: Supports billion-scale vector data with millisecond-level latency.

  • Flexible Indexing: Provides multiple indexing algorithms, including HNSW, IVF, and more.

  • Easy Integration: Offers SDKs for Python, Java, Go, etc.

  • Scalability: Can be deployed standalone or in a distributed cluster.


2. Setting Up Milvus

a. Install Milvus

You can run Milvus using Docker for simplicity.

# Pull the latest Milvus standalone Docker image
docker pull milvusdb/milvus:v2.2.9

# Start Milvus standalone
docker run -d --name milvus-standalone -p 19530:19530 milvusdb/milvus:v2.2.9

b. Install PyMilvus

Install the Python SDK to interact with Milvus.

pip install pymilvus==2.2.9

3. Generating Embeddings

We'll use OpenAI's API to generate embeddings for sample text data.

a. Sample Dataset

documents = [
    {
        "id": "1",
        "content": "The quick brown fox jumps over the lazy dog.",
        "metadata": {"category": "animal"}
    },
    {
        "id": "2",
        "content": "An apple a day keeps the doctor away.",
        "metadata": {"category": "health"}
    },
    {
        "id": "3",
        "content": "To be or not to be, that is the question.",
        "metadata": {"category": "literature"}
    }
]

b. Generate Embeddings Using OpenAI

import openai
import os

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_embedding(text):
    response = openai.Embedding.create(
        input=text,
        engine='text-embedding-ada-002'  # Use the appropriate engine
    )
    embedding = response['data'][0]['embedding']
    return embedding

Generate embeddings for each document:

for doc in documents:
    doc['embedding'] = generate_embedding(doc['content'])

4. Connecting to Milvus

from pymilvus import connections

# Connect to Milvus
connections.connect("default", host="127.0.0.1", port="19530")

5. Creating a Collection

In Milvus, data is stored in collections.

a. Define Collection Schema

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# Define fields
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=64),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=len(documents[0]['embedding'])),
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
]

# Create schema
schema = CollectionSchema(fields, description="Document collection")

b. Create the Collection

collection_name = "documents_collection"

# Drop the collection if it exists
if Collection.exists(collection_name):
    Collection(collection_name).drop()

# Create the collection
collection = Collection(name=collection_name, schema=schema)

6. Inserting Vectors

a. Prepare Data

Milvus expects data in the form of lists corresponding to each field.

# Separate data into fields
ids = [doc['id'] for doc in documents]
contents = [doc['content'] for doc in documents]
embeddings = [doc['embedding'] for doc in documents]
categories = [doc['metadata']['category'] for doc in documents]

# Prepare data list
data = [ids, contents, embeddings, categories]

b. Insert Data into Milvus

# Insert data
insert_result = collection.insert(data)

c. Load the Collection

# Load the collection into memory
collection.load()

7. Creating Indexes

Milvus supports various indexing methods, including HNSW.

a. Create an HNSW Index

index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 8, "efConstruction": 64},
}

collection.create_index(field_name="embedding", index_params=index_params)
  • metric_type: "COSINE" to use cosine similarity.

  • index_type: "HNSW" specifies the HNSW indexing algorithm.

  • params: Algorithm-specific parameters.

b. Release Index Resources (Optional)

After indexing, you can release index resources if needed.

# Release the collection (optional)
# collection.release()

8. Performing Searches

a. Prepare Query Embedding

query_text = "What is the meaning of life?"
query_embedding = generate_embedding(query_text)
# Define search parameters
search_params = {
    "metric_type": "COSINE",
    "params": {"ef": 64},
}

# Perform the search
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=2,
    expr=None,
    output_fields=["id", "content", "category"]
)
  • data: List of query embeddings.

  • anns_field: The field containing embeddings.

  • param: Search parameters.

  • limit: Number of nearest neighbors to retrieve.

  • expr: Optional filter expression.

  • output_fields: Fields to include in the results.

c. Display Results

# Display search results
for result in results[0]:
    print(f"ID: {result.entity.get('id')}")
    print(f"Content: {result.entity.get('content')}")
    print(f"Category: {result.entity.get('category')}")
    print(f"Distance (Cosine Similarity): {result.distance}")
    print("--------")

9. Complete Example Code

Here's the full code combining all the steps:

import openai
import os
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

# Set up OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Sample dataset
documents = [
    {
        "id": "1",
        "content": "The quick brown fox jumps over the lazy dog.",
        "metadata": {"category": "animal"}
    },
    {
        "id": "2",
        "content": "An apple a day keeps the doctor away.",
        "metadata": {"category": "health"}
    },
    {
        "id": "3",
        "content": "To be or not to be, that is the question.",
        "metadata": {"category": "literature"}
    }
]

# Function to generate embeddings
def generate_embedding(text):
    response = openai.Embedding.create(
        input=text,
        engine='text-embedding-ada-002'
    )
    embedding = response['data'][0]['embedding']
    return embedding

# Generate embeddings for each document
for doc in documents:
    doc['embedding'] = generate_embedding(doc['content'])

# Connect to Milvus
connections.connect("default", host="127.0.0.1", port="19530")

# Define fields
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=64),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=len(documents[0]['embedding'])),
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
]

# Create schema
schema = CollectionSchema(fields, description="Document collection")

# Create the collection
collection_name = "documents_collection"
if Collection.exists(collection_name):
    Collection(collection_name).drop()
collection = Collection(name=collection_name, schema=schema)

# Prepare data
ids = [doc['id'] for doc in documents]
contents = [doc['content'] for doc in documents]
embeddings = [doc['embedding'] for doc in documents]
categories = [doc['metadata']['category'] for doc in documents]
data = [ids, contents, embeddings, categories]

# Insert data
insert_result = collection.insert(data)

# Create index
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 8, "efConstruction": 64},
}
collection.create_index(field_name="embedding", index_params=index_params)

# Load the collection
collection.load()

# Prepare query
query_text = "What is the meaning of life?"
query_embedding = generate_embedding(query_text)

# Search parameters
search_params = {
    "metric_type": "COSINE",
    "params": {"ef": 64},
}

# Perform search
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=2,
    expr=None,
    output_fields=["id", "content", "category"]
)

# Display results
for result in results[0]:
    print(f"ID: {result.entity.get('id')}")
    print(f"Content: {result.entity.get('content')}")
    print(f"Category: {result.entity.get('category')}")
    print(f"Distance (Cosine Similarity): {result.distance}")
    print("--------")

Sample Output:

ID: 3
Content: To be or not to be, that is the question.
Category: literature
Distance (Cosine Similarity): 0.0153
--------
ID: 2
Content: An apple a day keeps the doctor away.
Category: health
Distance (Cosine Similarity): 0.0167
--------

10. Considerations and Best Practices

a. Indexing Parameters

  • M: Controls the number of bi-directional links created for each new element during index construction. Higher values improve search accuracy but increase memory usage.

  • efConstruction: Controls the construction time and index accuracy. Higher values lead to better accuracy but slower index building.

b. Search Parameters

  • ef: The size of the dynamic list for the query. Higher values improve recall but increase search time.

c. Data Normalization

  • For cosine similarity, you don't need to normalize vectors when using Milvus, as it handles the metric internally.

d. Error Handling

  • Include error handling in production code to manage exceptions and ensure robustness.

e. Scaling

  • For large datasets, consider deploying Milvus in a distributed mode using Milvus Operator or Helm charts.

f. Security

  • Secure your Milvus deployment, especially when exposing it over a network. Use authentication and secure communication channels where appropriate.

11. Conclusion

By using Milvus as a vector store database, you can efficiently store and retrieve embeddings for applications like similarity search, recommendation systems, and more. The approach involves:

  • Generating Embeddings: Convert your data into vector embeddings using a model like OpenAI's embeddings.

  • Connecting to Milvus: Use PyMilvus to interact with your Milvus instance.

  • Creating a Collection: Define the schema and create a collection to store your data.

  • Inserting Vectors: Insert your embeddings and associated data into the collection.

  • Creating Indexes: Build an index (e.g., HNSW) to enable efficient search.

  • Performing Searches: Query the database using embeddings to find similar items.

Milvus abstracts much of the complexity involved in managing vector data, allowing you to focus on building your application's logic.


Additional Resources