RouteLLM Unlocking Cost Effective LLM Routing

Introduction

Costs associated with using closed-source large language models can add up in the use cases of complex tasks due to the nature of how tokens are priced for using APIs. RouteLLM is a open-sourced project that creates a method to determine based on the query a user sends which LLM to choose based on the query. This routing method substantially lowers the costs traditionally associated with using just one model for large tasks that might not meet the requirements if it can be served by another model of similar functionality. In short, think of you’re determining a path on a road trip you have numerous ways to reach the destination but which one is effective and cost efficient?

Getting Started

I’ve been exploring the use of methods to chain tasks together for various use cases such as research and analyzing outputs of various articles, queries and this stuff adds up. If I was completely running a Claude Sonnet, GPT-4 or other hosted model with large input context windows those costs can add up for small projects. For simplicity you can use the following code as a example as its part of the quick start with slight modification.

Requirements

OpenAI API Access to GPT-models
Groq API Key – Accessing models
Route LLM
Python 3.9 Installed

For starters we’d need the installation of the package to use Routellm

python -m venv routellm
routellm\Scripts\activate (on Windows)
pip install "routellm[serve,eval]"
pip install groq
pip install python-dotenv

Once these are up and running in our virtual environment we can then use the following code.

import os
from groq import Groq
from routellm.controller import Controller
from dotenv import load_dotenv

# load environment variables
load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# create a controller
client = Controller(
    routers=["mf"],
    strong_model="openai/gpt-4o-mini-2024-07-18",
    weak_model="groq/llama3-8b-8192",
)

# Dynamic query that is inserted into the router
query = input('Enter the desired query: ')

# get completions
chat_completion = client.chat.completions.create(
    model="router-mf-0.11593", # Tells Route LLM to use the MF router with a cost threshold of 0.11593
    messages=[
        {
            "role": "user",
            "content": query,
        }
    ],
)

message_content = chat_completion['choices'][0]['message']['content']
model_selected = chat_completion['model']

Now a few items to break down from this code the use of the Controller initializes the models using the strong_model, weak_model notice these are denoted a little different in the reference. Since we are using the model referenced from Groq I’ve chosen the model llama3b-8b-8192 and gpt-4o-mini-2024-07-18. How this works is the controller sets the configuration and the use of the chat_completions references the “router-mf-0.11593” the 0.11593 is the threshold of costs we are willing to limit to the router. Now the dynamic use of query is a simple input that gets passed to the router.

Notice the query that is sent is related to Generative AI Security and the model response shows the chosen selected model being Groq. This saves costs because effectively in reason using Groq API is mostly free. The response is thorough in the response and sufficient for this specific ask.

So now I’ve asked it the following prompt to see what it responds with and which model.

Identify potential security threats in a generative AI model's training data. What types of malicious data might be present, and how can they impact model performance and security?

Response as shown below is thorough and chooses Llama-3.

Expanding the uses

Where I could see this being expanded in a project would be in a chat system such as front-end user experience that depending on the query this dynamically selects a model based on difficulty. Another area you could consider the use of RouteLLM is when working with agents perhaps a agent doesn’t need the higher-end access LLM this can route effectively and lower the cost. Given the speed of response given from Groq its hard to compete on the user experience because the response is blazing fast however for more complex tasks that take inputs this can be factored as well. This is a evolving landscape of tools and frameworks to assist end users with adopting generative AI but this one stands out as a cost-effective long term strategy.