Server integration overview
This is a preview of server integration with the Modular Inference Engine. It is not publicly available yet.
If you’re interested, please sign up for early access.
We know how important it is to have an end-to-end serving solution for AI production deployment with scaling and monitoring, as part of your MLOps infrastructure. So we packaged the Modular Inference Engine as a drop-in backend for popular model inference servers, providing you instant performance gains for your existing PyTorch and TensorFlow inference workloads.
The superior performance provided by the Modular Inference Engine will significantly lower your model serving latency, increase serving throughput, and reduce operational compute costs for your production AI workloads.
We provide full compatibility with existing inference servers:
- Server-side TensorFlow and PyTorch model configurations work as-is by simply changing the backend name.
- Existing client code that sends inference requests works as-is.
No extra development effort is required to leverage the higher inference performance of our unified engine.
Example configuration
If you’re using NVIDIA’s Triton Inference Server or TensorFlow Serving, you can just drop the Modular Inference Engine backend library into your existing inference server image, and then rename the backend name in your TensorFlow or PyTorch model configuration files. That’s it!
For example, below is a model configuration file for the Triton Inference Server using the Modular Inference Engine as the compute backend. The only difference, compared to a TensorFlow or PyTorch backend configuration, is the backend
name:
bert-config.pbtxt
: "bert-base.savedmodel"
default_model_filename: "modular"
backend
input {name: "attention_mask"
data_type: TYPE_INT32
dims: [-1, -1]
}
input {name: "input_ids"
data_type: TYPE_INT32
dims: [-1, -1]
}
input {name: "token_type_ids"
data_type: TYPE_INT32
dims: [-1, -1]
}
output {name: "end_logits"
data_type: TYPE_FP32
dims: [-1, -1]
}
output {name: "start_logits"
data_type: TYPE_FP32
dims: [-1, -1]
}
instance_group {kind: KIND_CPU
}
Example client request
For client programs that send requests to the server, you don’t need to change the code at all.
Here is some client code that sends BERT Q&A inference requests to the above Triton Inference Server configuration, which is using the Modular Inference Engine as its compute backend (you’d never know that it’s using Modular by looking at this code because it’s unaffected):
bert-client.py
from transformers import AutoTokenizer
import numpy as np
import tritonclient.grpc as grpcclient
def answer_question(
=None
triton_client, question, context, timeout
):= AutoTokenizer.from_pretrained(
tokenizer "bert-large-uncased-whole-word-masking-finetuned-squad"
)
# Convert the inputs to bert tokens.
= tokenizer(
inputs =True, return_tensors="tf"
question, context, add_special_tokens
)
= inputs["input_ids"].shape[1]
sequence_length
# Set the http outputs.
= [
grpc_tensors "attention_mask", (1, sequence_length), "INT32"),
grpcclient.InferInput("input_ids", (1, sequence_length), "INT32"),
grpcclient.InferInput("token_type_ids", (1, sequence_length), "INT32"),
grpcclient.InferInput(
]
# Tokenized input tensors -> triton.
0].set_data_from_numpy(inputs["attention_mask"].numpy())
grpc_tensors[1].set_data_from_numpy(inputs["input_ids"].numpy())
grpc_tensors[2].set_data_from_numpy(inputs["token_type_ids"].numpy())
grpc_tensors[
# Get the result from the server.
= triton_client.infer("bert-large", grpc_tensors, timeout=timeout)
result
# Reshape back to `sequence_length`
= result.as_numpy(f"start_logits")[:, :sequence_length]
server_start = result.as_numpy(f"end_logits")[:, :sequence_length]
server_end
# Use numpy to get the predicted start and end position from the
# output softmax scores.
= np.argmax(server_start, axis=1)[0]
predicted_start = np.argmax(server_end, axis=1)[0] + 1
predicted_end
# The answer is expressed in terms of positions in the input, so we need
# this to be able to map back to the answer text
= inputs["input_ids"].numpy()[0]
input_ids
# Use above positions to find the answer in the input.
= tokenizer.convert_ids_to_tokens(
answer_tokens
input_ids[predicted_start:predicted_end]
)
# Convert it into human readable string,
= tokenizer.convert_tokens_to_string(answer_tokens)
answer return answer
def main(context_filename, host, port):
with open(context_filename) as f:
= f.read()
context print("Context:\n", context)
# Open the triton server connection.
= f"{host}:{port}"
url = grpcclient.InferenceServerClient(url=url)
triton_client
while True:
= input("> ")
question = answer_question(triton_client, question, context)
output print(output)
# Close server connection.
triton_client.close()
if __name__ == "__main__":
import argparse
= argparse.ArgumentParser(prog="bert-cli")
parser "-c", "--context", required=True, help="Context file")
parser.add_argument(
parser.add_argument("-s", "--server", required=True, help="Inference server host"
)
parser.add_argument("-p",
"--port",
=False,
required="8001",
defaulthelp="Inference server port",
)= parser.parse_args()
args main(args.context, args.server, args.port)
The above client code is no different from code you can use with a Triton Inference Server instance that’s running a different backend. So it’s easy to just update the model config and be done.
Also check out the Triton serving demo from our launch keynote video, which shows how your inference latency drops when switching from the default TensorFlow backend to the Modular backend.