
Deploy a text embedding model with an endpoint
Text embeddings are rich numerical representations of text that power many modern natural language processing (NLP) applications. This tutorial shows you how to run and interact with an embeddings endpoint using MAX Serve. Specifically, we'll use the all-mpnet-base-v2 model, which is a powerful transformer that excels at capturing semantic relationships in text.
In this tutorial, you'll learn how to:
- Set up a local embeddings server using the
all-mpnet-base-v2
model - Build a smart knowledge base system using semantic similarity
- Implement document clustering and topic-based organization
- Create robust search functionality using embeddings
Local setup
In this section, you will set up and run the all-mpnet-base-v2 model locally using MAX Serve.
Start the embeddings server
Use the magic
CLI tool to start the embeddings server locally:
-
If you don't have the
magic
CLI yet, you can install it on macOS and Ubuntu Linux with this command:curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the
source
command that's printed in your terminal. -
Use
magic
to install ourmax-pipelines
CLI tool:magic global install max-pipelines
magic global install max-pipelines
-
Start a local endpoint for
all-mpnet-base-v2
:max-pipelines serve --model-path=sentence-transformers/all-mpnet-base-v2
max-pipelines serve --model-path=sentence-transformers/all-mpnet-base-v2
This will create a server running the
all-mpnet-base-v2
embeddings model onhttp://localhost:8000/v1/embeddings
, an OpenAI compatible endpoint.The endpoint is ready when you see the URI printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
-
Send a curl request to the endpoint
Let's send a curl request to see what kind of response we get back.
With the server running in your first terminal, run the following command in the second terminal:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Run an embedding model with MAX Serve!",
"model": "sentence-transformers/all-mpnet-base-v2"
}'curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Run an embedding model with MAX Serve!",
"model": "sentence-transformers/all-mpnet-base-v2"
}'The following is the expected output.
{"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,
{"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,
The text has been shortened for brevity. This returns a numerical representation of the input text that can be used for semantic comparisons.
Now that the endpoint is active and responsive, let's create an application that uses the embedding model and retrieves information.
Build a knowledge base system
Now, let's build a smart knowledge base using the all-mpnet-base-v2
model. You'll
create a system that can match user queries to relevant documentation and
automatically organize content into topics.
1. Install dependencies
Let's create a new Python project using magic
to manage our packages.
-
In a second terminal, run the following command:
magic init embeddings --format pyproject && cd embeddings
magic init embeddings --format pyproject && cd embeddings
-
Add three new libraries to
magic
:magic add numpy scikit-learn requests
magic add numpy scikit-learn requests
These libraries help measure similarity of sentences and handle various computational tasks. The requests library enables API communication with the embeddings endpoint.
2. Implement the knowledge base system
Now we will create a smart knowledge base system that can:
- Process and store documents with their semantic embeddings
- Search for relevant documents using natural language queries
- Automatically organize content into topics using clustering
- Suggest relevant topics based on user queries
The system uses embeddings from the all-mpnet-base-v2
model to understand the
meaning of text, enabling semantic search and intelligent document organization.
-
Create a new Python file called
kb_system.py
in thesrc/embeddings
directory and add the following:import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import requests
from typing import List, Dict, Tuple
from functools import lru_cache
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SmartKnowledgeBase:
def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"):
self.endpoint = endpoint
self.documents: List[str] = []
self.doc_titles: List[str] = []
self.embeddings: np.ndarray = None
self.clusters: Dict[int, List[int]] = {}
def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray:
"""Get embeddings with retry logic."""
for attempt in range(max_retries):
try:
response = requests.post(
self.endpoint,
headers={"Content-Type": "application/json"},
json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"},
timeout=5
).json()
return np.array([item["embedding"] for item in response["data"]])
except Exception as e:
if attempt == max_retries - 1:
raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}")
logger.warning(f"Attempt {attempt + 1} failed, retrying...")
@lru_cache(maxsize=1000)
def _get_embedding_cached(self, text: str) -> np.ndarray:
"""Cached version for single text embedding."""
return self._get_embedding([text])[0]
def add_document(self, title: str, content: str):
"""Add a single document with title."""
self.doc_titles.append(title)
self.documents.append(content)
# Update embeddings
if len(self.documents) == 1:
self.embeddings = self._get_embedding([content])
else:
self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])])
# Recluster if we have enough documents
if len(self.documents) >= 3:
self._cluster_documents()
def _cluster_documents(self, n_clusters: int = None):
"""Cluster documents into topics."""
if n_clusters is None:
n_clusters = max(2, len(self.documents) // 5)
n_clusters = min(n_clusters, len(self.documents))
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings)
self.clusters = {}
for i in range(n_clusters):
self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist()
def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]:
"""Find documents most similar to the query."""
query_embedding = self._get_embedding_cached(query)
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(self.doc_titles[i], self.documents[i], similarities[i])
for i in top_indices]
def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]:
"""Get all documents in a topic cluster."""
return [(self.doc_titles[i], self.documents[i])
for i in self.clusters.get(topic_id, [])]
def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]:
query_embedding = self._get_embedding_cached(query)
topic_similarities = []
for topic_id, doc_indices in self.clusters.items():
topic_embeddings = self.embeddings[doc_indices]
similarity = cosine_similarity([query_embedding], topic_embeddings).max()
topic_similarities.append((topic_id, similarity)) # Remove [0]
return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k]
# Example usage
if __name__ == "__main__":
# Initialize knowledge base
kb = SmartKnowledgeBase()
# Add technical documentation
kb.add_document(
"Password Reset Guide",
"To reset your password: 1. Click 'Forgot Password' 2. Enter your email "
"3. Follow the reset link 4. Create a new password meeting security requirements"
)
kb.add_document(
"Account Security",
"Secure your account by enabling 2FA, using a strong password, and regularly "
"monitoring account activity. Enable login notifications for suspicious activity."
)
kb.add_document(
"Billing Overview",
"Your billing cycle starts on the 1st of each month. View charges, update "
"payment methods, and download invoices from the Billing Dashboard."
)
kb.add_document(
"Payment Methods",
"We accept credit cards, PayPal, and bank transfers. Update payment methods "
"in Billing Settings. New payment methods are verified with a $1 hold."
)
kb.add_document(
"Installation Guide",
"Install by downloading the appropriate package for your OS. Run with admin "
"privileges. Follow prompts to select installation directory and components."
)
kb.add_document(
"System Requirements",
"Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, "
"4GB storage, SSD, modern multi-core processor for optimal performance."
)
# Example 1: Search for password-related help
print("\nSearching for password help:")
results = kb.search("How do I change my password?")
for title, content, score in results:
print(f"\nTitle: {title}")
print(f"Relevance: {score:.2f}")
print(f"Content: {content[:100]}...")
# Example 2: Get topic suggestions
print("\nGetting topics for billing query:")
query = "Where can I update my credit card?"
topics = kb.suggest_topics(query)
for topic_id, relevance in topics:
print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):")
for title, content in kb.get_topic_documents(topic_id):
print(f"- {title}: {content[:50]}...")
# Example 3: Get all documents in a topic
print("\nAll documents in Topic 0:")
for title, content in kb.get_topic_documents(0):
print(f"\nTitle: {title}")
print(f"Content: {content[:100]}...")import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import requests
from typing import List, Dict, Tuple
from functools import lru_cache
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SmartKnowledgeBase:
def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"):
self.endpoint = endpoint
self.documents: List[str] = []
self.doc_titles: List[str] = []
self.embeddings: np.ndarray = None
self.clusters: Dict[int, List[int]] = {}
def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray:
"""Get embeddings with retry logic."""
for attempt in range(max_retries):
try:
response = requests.post(
self.endpoint,
headers={"Content-Type": "application/json"},
json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"},
timeout=5
).json()
return np.array([item["embedding"] for item in response["data"]])
except Exception as e:
if attempt == max_retries - 1:
raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}")
logger.warning(f"Attempt {attempt + 1} failed, retrying...")
@lru_cache(maxsize=1000)
def _get_embedding_cached(self, text: str) -> np.ndarray:
"""Cached version for single text embedding."""
return self._get_embedding([text])[0]
def add_document(self, title: str, content: str):
"""Add a single document with title."""
self.doc_titles.append(title)
self.documents.append(content)
# Update embeddings
if len(self.documents) == 1:
self.embeddings = self._get_embedding([content])
else:
self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])])
# Recluster if we have enough documents
if len(self.documents) >= 3:
self._cluster_documents()
def _cluster_documents(self, n_clusters: int = None):
"""Cluster documents into topics."""
if n_clusters is None:
n_clusters = max(2, len(self.documents) // 5)
n_clusters = min(n_clusters, len(self.documents))
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings)
self.clusters = {}
for i in range(n_clusters):
self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist()
def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]:
"""Find documents most similar to the query."""
query_embedding = self._get_embedding_cached(query)
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(self.doc_titles[i], self.documents[i], similarities[i])
for i in top_indices]
def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]:
"""Get all documents in a topic cluster."""
return [(self.doc_titles[i], self.documents[i])
for i in self.clusters.get(topic_id, [])]
def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]:
query_embedding = self._get_embedding_cached(query)
topic_similarities = []
for topic_id, doc_indices in self.clusters.items():
topic_embeddings = self.embeddings[doc_indices]
similarity = cosine_similarity([query_embedding], topic_embeddings).max()
topic_similarities.append((topic_id, similarity)) # Remove [0]
return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k]
# Example usage
if __name__ == "__main__":
# Initialize knowledge base
kb = SmartKnowledgeBase()
# Add technical documentation
kb.add_document(
"Password Reset Guide",
"To reset your password: 1. Click 'Forgot Password' 2. Enter your email "
"3. Follow the reset link 4. Create a new password meeting security requirements"
)
kb.add_document(
"Account Security",
"Secure your account by enabling 2FA, using a strong password, and regularly "
"monitoring account activity. Enable login notifications for suspicious activity."
)
kb.add_document(
"Billing Overview",
"Your billing cycle starts on the 1st of each month. View charges, update "
"payment methods, and download invoices from the Billing Dashboard."
)
kb.add_document(
"Payment Methods",
"We accept credit cards, PayPal, and bank transfers. Update payment methods "
"in Billing Settings. New payment methods are verified with a $1 hold."
)
kb.add_document(
"Installation Guide",
"Install by downloading the appropriate package for your OS. Run with admin "
"privileges. Follow prompts to select installation directory and components."
)
kb.add_document(
"System Requirements",
"Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, "
"4GB storage, SSD, modern multi-core processor for optimal performance."
)
# Example 1: Search for password-related help
print("\nSearching for password help:")
results = kb.search("How do I change my password?")
for title, content, score in results:
print(f"\nTitle: {title}")
print(f"Relevance: {score:.2f}")
print(f"Content: {content[:100]}...")
# Example 2: Get topic suggestions
print("\nGetting topics for billing query:")
query = "Where can I update my credit card?"
topics = kb.suggest_topics(query)
for topic_id, relevance in topics:
print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):")
for title, content in kb.get_topic_documents(topic_id):
print(f"- {title}: {content[:50]}...")
# Example 3: Get all documents in a topic
print("\nAll documents in Topic 0:")
for title, content in kb.get_topic_documents(0):
print(f"\nTitle: {title}")
print(f"Content: {content[:100]}...")The
SmartKnowledgeBase
class implements an intelligent document retrieval and organization system using embeddings. You can add documents (kb.add_document()
), search based on the user's question (kb.searchsearch()
), and retrieve results. -
Run the script:
With the server running in your first terminal, run the following command in the second terminal:
magic run python -m embeddings.kb_system
magic run python -m embeddings.kb_system
On your first run, this might take longer. The following is the expected output.
Title: Password Reset Guide
Relevance: 0.61
Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C...Title: Password Reset Guide
Relevance: 0.61
Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C...The text has been shortened for brevity.
Conclusion
In this tutorial, you learned how to:
- Set up and test a local embeddings server using the
all-mpnet-base-v2
model - Build a smart knowledge base system that can process and retrieve documents based on semantic similarity
- Implement document clustering and topic-based organization
- Create a robust search functionality using embeddings
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!