
DeepSeek-V3 is an open-source large language model that boast a 671-billion parameter Mixture-of-Experts architecture with only 37 billion parameters activated per token. This specific model uses Multi-Head Latent Attention (MLA) for inference this compresses the attention keys and values in a low dimensional latent representation. Additionally this has also the strategy of Auxiliary-Loss-Free load balancing which balances computational load among multiple experts in a Mixture-of-Experts model that ensures tokens are evenly distributed across experts during training. I’ve covered this concept in a previous blog to be brief this should have some very concise responses with some tinkering on parameters and temperature. Recently, Azure Foundry allowed this to be consumed via serverless API with some relatively low costs. I’ve decided to show the integration of this step by step by also using LangChain integration to demonstrate the ways to extend the LLM.
Getting Started
For the requirements to follow along should you want to replicate this you’ll need the following.
- Subscription (Azure)
- Azure Foundry Hub + Project
- Python 3.11 (Or later hopefully)
To get up and running with the Azure Foundry this allows the end user to consume 3P models via a Model Catalog assuming you have your project set up this will be on the left hand side blade shown as Models + Endpoints.

We can see the DeepSeek-V3 is at the top of “What’s New?” we can select from here Check out model.
You can also do this similar by the Models + Endpoints and select Deploy Base Model -> Search “DeepSeek”

You’ll see the terms and conditions page that inform you the usage of this specific model and also some pricing information.

We hit Agree and Proceed this will start the provision process of the model in the portal that will provide us a unique name and API key to start consuming the model.
Pricing as it stands today referencing this link.
Model | Input Pricing (Per 1K Tokens | Output Pricing (Per 1K Tokens) |
DeepSeek-V3 Global | $0.0014 | $0.00456 |
DeepSeek-V3 Region | $0.00125 | $0.005 |
Now that the pricing is out of the way after we hit Agree and Proceed we are prompted with the Model Deployment Name this what you call that can be unique and the deployment method at this point I only have Global Standard as a option.

After this is deployed you’ll see the <endpoint> url that we will need for our environment variables and it will mention that deployment name is what you’ll call to the endpoint.
We move over to our IDE and create a our virtual environment and folder.
mkdir /deepseek
cd deepseek
python3 -m venv deepseek
cd deepseek/Scripts/
./Activate.ps1 (I'm on windows)
For our requirements.txt it should look like the following.
#requirements.txt
azure-ai-inference==1.0.0b9
azure-identity==1.20.0
langchain-core==0.3.43
langchain-azure-ai==0.1.2
langchain-community==0.3.19
python-dotenv==1.0.1
arxiv==2.1.3
PyMuPDF==1.25.4
Now in our notebook we have the following code block that should start the process of making sure we have everything set.
In our notebook we should have a code block as shown after this is created to get our environment running.
%pip install -r requirements.txt
Setting our environment variables using python-dotenv which reads a key-value pair locally.
#.env.example
AZURE_AI_ENDPOINT=<url>
AZURE_AI_KEY=<key>
DEPLOYMENT_NAME=<deployment_name>
REGION=<region>
Now we can get to the base of our code which will start with initialization of our client and other components.
import os
from azure.core.credentials import AzureKeyCredential
from langchain_azure_ai.chat_models import AzureAIChatCompletionsModel
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_community.retrievers import ArxivRetriever
# Load environment variables
load_dotenv()
# Get the environment variables
endpoint = os.getenv("AZURE_AI_ENDPOINT") # Azure AI endpoint environment variable this is the Completions Endpoint
model_name = os.getenv("AZURE_DEPLOYMENT") # Azure AI model name environment variable
key = os.getenv('AZURE_AI_KEY') # Azure AI key environment variable
region = os.getenv("REGION") # Region the model is deployed in
wrapper_key = AzureKeyCredential(key) # Method of wrapping the key in AzureKeyCredential
# Initialize the Azure AI Chat Completions Model
model = AzureAIChatCompletionsModel(
endpoint=endpoint,
credential=wrapper_key,
model_name=model_name,
max_tokens=2048
)
# Initialize the Arxiv Retriever
arxiv_r = ArxivRetriever(
load_max_docs = 3, # Instructs the maximum papers to load
get_full_documents=True # Instructs to get the full documents
)
# Prompt template
prompt_template = PromptTemplate(
input_variables = ["query", "relevant_info"],
template = """
You are an expert in Security Research of a variety of topics specifically AI Security. Use the following information from arXiv to answer the user's question. If there is no sufficient information, say 'I need more information to answer this question'.
Question: {query}
Relevant Information:
{relevant_info}
Answer:
"""
)
This adopts a similar pattern inspired by LazaUK on Github with alterations. We note the prompt template that defines what our model is directed to do prior to receiving a query then we also say we have two input variables “query” and “relevant_info”.
Chaining the model
The simple step of chaining just refers to taking our Prompt Template + Model shown below.

Once this runs we can access this by running chain and see our components along with our model defined.
Next this portion of the code does two things it runs the invocation to the ‘arXiv’ it gets the title, page content and runs a for loop. Based on those results that is then passed to our chain with the invoke method stored as ‘relevant_info’.
# Analyze the research papers
try:
papers = arxiv_r.invoke(query) # Papers retrieved from the Arxiv Retriever
relevant_info = "\n".join([
f"Title: {paper.metadata.get('title', 'No title')}\nAbstract: {paper.page_content}"
for paper in papers
]) # Relevant information from the papers
response = chain.invoke({"query": query, "relevant_info": relevant_info}) # Response from the model
print("Response", response.content)
print("-----" * 20)
# Usage stats from our query
print("Usage of query")
print("\tPrompt Tokens:", response.usage_metadata["input_tokens"])
print("\tCompletion Tokens:", response.usage_metadata["output_tokens"])
print("\tTotal Tokens:", response.usage_metadata["total_tokens"])
print("-----" * 20)
# Cost rates per 1K Tokens (in USD)
INPUT_COST_PER_1K = 0.00114
OUTPUT_COST_PER_1K = 0.00456
# Estimated cost of the query
input_cost = (response.usage_metadata["input_tokens"] / 1000) * INPUT_COST_PER_1K
output_cost = (response.usage_metadata["output_tokens"] / 1000) * OUTPUT_COST_PER_1K
total_cost = input_cost + output_cost
# Print cost with proper formatting
print("Estimated Cost")
print(f"\tInput cost ({INPUT_COST_PER_1K:.5f}/1K tokens): ${input_cost:.5f}")
print(f"\tOutput cost ({OUTPUT_COST_PER_1K:.5f}/1K tokens): ${output_cost:.5f}")
print(f"\tTotal cost: ${total_cost:.5f}")
print("-----" * 20)
except Exception as e:
print(f"Error processing query: {e}")
import traceback
traceback.print_exc()

We can get the usage of the query along with calculations of the cost of running the query which is relatively cheap given we’ve added context and it is reasoning.
So now I’m changing the query to probe “Summarize the existing challenges in AI Security” and run through the chain again.
The results change to more qualitative areas such as Opacity of AI Systems, Compliance and Regulatory Frameworks and others.

Now we can also stream in another example such as expanding the use to something more practical. I’m going to reformat the prompt template to reflect a financial advisor with relevant information that sends to the LLM to give a financial projection.
financial_template = PromptTemplate(
input_variables = ["background"],
template = """
You are an experienced Financial Advisor that helps hedge fund clients, and personal investment decisions for clients. Use the following background information to answer the user's inquiry on the best financial path given the information along with advice on a diverse portfolio. Use principled investing strategies to provide the best advice.
{background_info}
Provide your advice on the best financial path for the client and advice on a diverse portfolio, and reasoning include specific best case scenario and worst case scenario on a 10 year basis.
"""
)
chain = financial_template | model
We then put together a “background_info” this would represent your use case as this can be dynamic for simplicity I like to see the reasoning on decisions for investments with direction.
background_info = """"
Client Age: 24
Client Income: 250,000
Client Savings: 100,000
Client Debt: 50,000
Client Goals: Aggressive Growth with a time horizon of 10 years and a risk tolerance of 7/10 Minimize risk of tax long term with Roth IRA/Backdoor Roth IRA
Client Investment Knowledge: Intermediate
"""
Then we put it all together with the same invocation with just (background_info).
# Put it all together
response = chain.invoke(background_info)
print("Response", response.content)
print("-----" * 20)
print("Usage of query")
print("\tPrompt Tokens:", response.usage_metadata["input_tokens"])
print("\tCompletion Tokens:", response.usage_metadata["output_tokens"])
print("\tTotal Tokens:", response.usage_metadata["total_tokens"])
print("-----" * 20)
# Cost rates per 1K Tokens (in USD)
INPUT_COST_PER_1K = 0.00114
OUTPUT_COST_PER_1K = 0.00456
# Estimated cost of the query
input_cost = (response.usage_metadata["input_tokens"] / 1000) * INPUT_COST_PER_1K
output_cost = (response.usage_metadata["output_tokens"] / 1000) * OUTPUT_COST_PER_1K
total_cost = input_cost + output_cost
# Print cost with proper formatting
print("Estimated Cost")
print(f"\tInput cost ({INPUT_COST_PER_1K:.5f}/1K tokens): ${input_cost:.5f}")
print(f"\tOutput cost ({OUTPUT_COST_PER_1K:.5f}/1K tokens): ${output_cost:.5f}")
print(f"\tTotal cost: ${total_cost:.5f}")
print("-----" * 20)

The best part about this output as instructed I get the best case scenario and worst case scenario with a example given the context.
Best Case Scenario on a 10 Year Basis
- Assumptions: Average annual return of 8-10% from a diversified equity-heavy portfolio, consistent contributions, and favorable market conditions
- Outcome: The client’s $100,000 initial savings, combined with annual contributions, could grow to $400,000-$500,000+. Debt is paid off, and the client has a robust portfolio with significant tax-free growth from the Roth IRA.
- Example: Starting with $100,000 and adding $30,000 annually, at a 9% return, the portfolio could reach ~$550,000 in 10 years.
Worst Case Scenario on a 10 Year Basis
- Assumptions: Market downturn, recession, or prolonged bear market with average annual returns of 0-2%.
- Outcome: The portfolio grows minimally, potentially reaching $150,000-$200,000 despite consistent contributions. However, the client’s time horizon allows for recovery, and the diversified portfolio mitigates losses.
- Example: Starting with $100,000 and adding $30,000 annually, at a 1% return, the portfolio could reach ~$200,000 in 10 years.
Summary
DeepSeek V3 is relatively a powerful model that with minimal costs makes some pretty robust outputs with some tuning such as adding context. From my experience getting the most from reasoning models is a combination of prompting and context that is in the form of RAG or existing knowledge infused in the initial query. Consider the use of DeepSeek for some of your heavy lifting reasoning with practical costs and given the serverless adaption is simple to use this can be an option. This notebook is hosted on my GitHub repository linked here.