Phi-3.5 Mixture of Experts

Introduction

Microsoft has open-sourced its Phi-3.5 Mixture of Experts model recently on the Azure AI Studio catalog provided as a (Model-as-a-Service) that you can run on Azure or you can also use Huggingface to utilize this model. The first question depending how much you’re following along with the constant upstream releases of models is the approach of Mixture of Experts. This follows as most of the Phi-3 models are being utilized for SLM preference for lower costs but also quick response. From minimal testing of this new model I’ve been tracking on specific responses with the mix of assistants and queries with some surprising responses.

Mixture of Experts

For every query you input into a Large Language Model this is computationally cost that can vary depending on the tokens input and complexity of the task. For the use of Mixture of Experts this takes a efficiency and scalability pattern of routing different parts of the input to specialized sub-models (“experts”). This rather then activating all parameters for every input, activates the use of this method to the expert and reduces the cost but also maintains a strong performance for the output. There is much more to this specific method that goes into some complex topics that won’t be covered in this blog post but briefly at a high-level this will distribute the query efficiently based on the selection of experts it routes to.

Deployment

Azure AI Studio (Or locally if you have the compute)
Python (This also supports other languages)
Azure AI Inference SDK (If using Azure AI Studio)

For starters we’ll need to consume the API this can be either in a dedicated compute method or serverless inferencing. For a quick method I’m opting for Serverless.

To deploy the model from the Catalog you’ll be in your Azure AI Studio (Assuming you have a project/hub) this shows the interface.

Select Base model this will list the models currently hosted search Phi-3.5

The description will also inform you on the training data and cutoff of the date October 2023.

Once you hit select this will prompt on the deployment method as shown below.

Serverless API with Azure AI Content Safety is the method I went with, notice if you do self-hosting you’ll likely have to consume the Azure AI Content Safety separately through the method to ensure you guard against inadvertent content.

Once the deployment finishes you’ll need two items to send out request via the SDK.

Since we are using API Key, Endpoint URL this will also have a tab that will list Consume

If you want to see code examples with other languages this will have the Python, C#, JSON.

First we’d need the following packages installed assuming you’ve wanted to use the SDK if you want to just test the playground also offers direct access.

pip install azure-ai-inference
pip install python-dotenv

Ensure you populate a .env (file) as the means to retrieve the endpoint, api_key that is required into the ChatCompletionsClient.

import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.inference import ChatCompletionsClient
from dotenv import load_dotenv

# Load the environment variables from the .env file
load_dotenv()

# Get the API key from the environment variables
api_key = os.getenv("AZURE_API_KEY")
endpoint = os.getenv("AZURE_ENDPOINT")

if not api_key:
  raise Exception("A key should be provided to invoke the endpoint")

client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(api_key)
)

model_info = client.get_model_info()
print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider_name)

payload = {
  "messages": [
    {
      "role": "user",
      "content": "Give me examples of Generative AI Security Strategies include a brief description and overview"
    },
    {
      "role": "assistant",
      "content": "Your a senior security advisor that assist with enterprises specializing in Generative AI Security, your knowledge is wide but also references known research MITRE Atlas, OWASP Top 10 LLMs, NIST 100-1 AI RMF."
    },
  ],
  "max_tokens": 2048,
  "temperature": 0.8,
  "top_p": 0.1,
  "presence_penalty": 0,
  "frequency_penalty": 0
}
response = client.complete(payload)

print("Response:", response.choices[0].message.content)
print("Model:", response.model)
print("Usage:")
print("	Prompt tokens:", response.usage.prompt_tokens)
print("	Total tokens:", response.usage.total_tokens)
print("	Completion tokens:", response.usage.completion_tokens)

I’ve had to test the User Interface as the SDK was giving me a new error on payloads.

From that simple prompt and directing the Assistant to update the direction of what the agent knows or is specialized for this provided this response.

Generative AI portion of the response on Security strategies directed we can see the importance of Data Governance is discussed and a more on Model Security.

Summary

This shows a brief overview of the use of Phi-3.5 models as a alternative but also the use of Mixture of Experts provides some high accuracy for responses. Like any of the use of models the direction and context you can provide in a stateful respect such as ‘Assistants’ lead to some meaningful responses. Also consider the use of a RAG model for directing in real time to also enrich the response source. Mixture of Experts while in the early stages is a new method that shows a path forward on smaller models for lowering cost while not losing accuracy or experience.