Model Router in Azure Foundry

Costs are typically a pain point for ideation on using state-of-the-art (SoTA) AI models for even some basic testing in the API. Most of the time if you’re building a proof of concept you’ll find that you don’t need the Ferrari to produce relatively strong results. In a nutshell the use of “reasoning” and “small-language models” can typically get you far but what if you could route requests based on the prompt complexity to minimize your costs. I’ve covered this concept in a previous blog post here which is a open-source approach. It appears this type of pattern is catching up with Cloud Service Providers such as Amazon Web Services and Microsoft Azure adding this capability.

Understanding Routing

For a analogy that is comparable I like to think of navigation you have options of how to reach a destination this can be a shortcut or “optimal” path or you can take another path. In the same vein you can compare when a user queries your model you’d likely wouldn’t want a lot of horsepower for something that can be achieved with lower costs. Instead of giving a application or workflow access to a reasoning model that can astronomically rack up your costs you can route your queries to a smaller model based on the complexity.

Azure Foundry has this feature in public preview that is currently leveraging the following models for this service that can route the request. The router service itself supports versioning you’ll see that if you select “Auto-update” this will update when new versions are available. The underlying models can also change with these updates, which can change the performance of the models and costs.

Underlying Models (version)

GPT-4.1 (2025-04-14)

GPT-4.1-mini (2025-04-14)

GPT-4.1-nano(2025-04-14)

o4-mini(2025-04-16)

Limitations

As of today June 1, 2025 some of the limitations outlined in documentation are the following.

Model router doesn’t process audio input
Context window listed on models page is the limit of the “Smallest” underlying model. Other underlying models are compatible with larger context windows, which can mean a API call with a larger context will succeed only if the prompt happens to be routed to the right model, otherwise the call will fail.
It is suggested to alleviate the context window you (Summarize the prompt before passing it to the model, Truncate the prompt into more relevant parts, Use document embeddings and have the chat model retrieve relevant sections.
Model router does accept image inputs for Vision enabled chats (all the underlying models can accept image input), but the routing decision is based on the text input only.

Quick Start

For the demonstration of this concept I’ve already put together example code and hosted in this repository.

Requirements for this are as follows.

Azure Foundry (Hub/Project)
Model Router API Deployed
Python 3.9 (hopefully above)

If you haven’t deployed the model router navigate to your Azure Foundry portal (https://ai.azure.com)

Under Model Deployments -> Deploy Model -> Base Model this will bring up the catalog.

Once we have the model consumable via serverless endpoint we can populate our files we will need.

I’m running locally however if you were pulling the secrets in production this would typically be stored in the Azure Key Vault or similar service. Bonus points if you use Entra ID tokens for retrieval to eliminate API key usage.

# Env variables
AZURE_ENDPOINT=<endpoint>
AZURE_OPENAI_API_KEY=<key>
DEPLOYMENT_NAME=<deployment-name>
API_VERSION=<version>

Now we have our main.py that I’ve abstracted a few concepts but this should do two things build a “class” call ModelRouterAgent this takes inputs that we have agent1, agent2.

import os
from openai import AzureOpenAI
from dotenv import load_dotenv


# Load environment variables from .env file
load_dotenv()

endpoint = os.environ.get('AZURE_ENDPOINT')
api_key = os.environ.get('AZURE_OPENAI_API_KEY')
deployment_name = os.environ.get('DEPLOYMENT_NAME')
api_version = os.environ.get('API_VERSION')

# Define a class for the Azure OpenAI client
class ModelRouterAgent:
    def __init__(self, system_prompt):
        self.endpoint = endpoint
        self.api_key = api_key
        self.deployment_name = deployment_name
        self.api_version = api_version
        self.client = AzureOpenAI(
            api_version=self.api_version,
            azure_endpoint=self.endpoint,
            api_key=self.api_key,
        )
        self.system_prompt = system_prompt
    # Define a method to send a message to the model
    def run(self, user_prompt):
        response = self.client.chat.completions.create(
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            max_tokens=8192,
            temperature=0.7,
            top_p=0.95,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            model=self.deployment_name,
        )
        output = response.choices[0].message.content
        # Print or return the model used
        print("Model chosen by the router:", response.model)
        return output, response.model
# Initialize the ModelRouterAgent with the system message
    def close(self):
        self.client.close()

To have the output to the end user of what model is chosen I’ve added the print statement that accesses response model. This isn’t available if you prefer the streaming response it will be abstracted.

Putting this all together I’m going to run through two system prompts that takes the prompt and then based on the output runs another from with a modified system prompt response.

The results will show us which model it chose based on this for each Agent 1 and Agent 2.

After you’ve completed this you can see the results chose gpt-4.1-mini while this isn’t the full response it also proposes in YAML what should be flagged based on the finding and suggest using OPA Gatekeeper.

Summary

Model router allows for responses to route based on text inputs that can determine the optimal model for use based on complexity of the request. Given the token costs and context window are make or break based on use case this should be considered for lowering costs across your usage. In some of the testing I’ve also seen the chosen model such as o4-mini which is still relatively inexpensive compared to reasoning tokens such as o3. The future state of this I could see evolving where you’d select which model you want included (this could be on the dream list).