Introduction
Most often using forms of LLM’s with a front-end UI has constraints for memory primarily because this is using the ChatCompletionsClient to initiate the conversation. This is stateless in nature meaning it is only limited to that session and the LLM’s knowledge for what is represented back to the end user, over time this has been upgraded for the consumer experience with some form of persistent memory that can recall previous conversations. As a result, the model’s responses are limited to its pre-existing knowledge and any context provided in the current session. While persistent memory capabilities have evolved to enhance the consumer experience that allows limited recall of previous conversations many freemium LLM services may not offer this feature which can result in a lackluster response without knowing the context.
Existing Data Strategies for Efficient Retrieval
When working with data retrieval for LLM applications, especially when scaling, implementing a robust data strategy is crucial. Vector databases, commonly used to store embeddings for efficient similarity search, form the backbone of many retrieval systems. Popular vector databases like Pinecone, Weaviate, and FAISS offer diverse functionalities, from managed solutions with built-in scalability to open-source options that require custom configuration. The choice of vector database depends largely on the access patterns—whether you need fast, low-latency lookups, real-time data updates, or bulk retrievals. Additionally, to streamline integration, many applications use a retrieval-augmented generation (RAG) approach, combining embeddings and prompt engineering to return contextually relevant data directly to the LLM. As data retrieval needs increase, using strategies like batching requests, caching frequent queries, and indexing high-priority documents can help maintain performance and optimize costs.
Enter LlamaIndex
This challenge is what led me to LlamaIndex—a robust data framework specifically designed for working with LLMs.
For today’s post I’m going to cover a simple way to set up using the quick start to illustrate some code blocks that are used to leverage LlamaIndex.
I’m using this document below in ./data folder as ‘owasptop10.txt’
1. **Prompt Injection** - Manipulating input prompts to alter LLM behavior unexpectedly.
2. **Insecure Output Handling** - Vulnerability occurs when an LLM output is accepted without scrutiny, exposing backend systems.
3. **Training Data Poisoning** - Introducing malicious data to compromise model integrity.
4. **Model Denial of Service** - Attacks cause resource-heavy operations on LLMs, leading to service degradation or high costs.
5. **Supply Chain Vulnerabilities** - Risks within LLM supply chains and dependencies.
6. **Sensitive Information Disclosure** - LLMs may inadvertently reveal confidential data in their responses, leading to unauthorized data access, privacy violations, and security breaches.
7. **Insecure Plugin Design** - LLM plugins can have insecure inputs and insufficient access control.
8. **Excessive Agency** - LLM-based systems may undertake actions leading to unintended consequences.
9. **Overreliance** - System or people overly depending on LLMs without oversight may face misinformation, miscommunication, legal issues, and security vulnerabilities due to incorrect or inappropriate content generated by LLMs.
10. **Model Theft** - This involves unauthorized access, copying, or exfiltration of proprietary LLM models.
For starters we are using a jupyter notebook you can use your own method if you want it all in one place or the controllability of jupyter.
For installation its noted that a few packages should be needed
%pip install llama-index-core
%pip install llama-index-llms-openai
%pip install llama-index
Since we are using a ‘Paid API’ this will require a OpenAI API Key the way you load this can differ but I’m running dotenv.
import os
from dotenv import load_dotenv
import logging
# load our environment variables
load_dotenv()
# Grab our values
openai_api_key = os.getenv("OPENAI_API_KEY")
After this runs smoothly, we can move to our documents that are being loaded for simplicity you can use the format shown below
Directory Structure
├── llama-index.ipnyb
└── data
└── owasptop10.txt
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
This calls the VectorStoreIndex, SimpleDirectoryReader.
Now that we have done this we can initiate the query to our data by running code in the following code block.
query_engine = index.as_query_engine() #Initialize the query engine
response = query_engine.query("What is prompt injection?") #Query the index
print(response)
You can also change this to the “What is model theft?”
To see what the queries and events are doing under the hood the documentation points to a code block shown below.
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
I’ve altered the logging level to be INFO if you require more verbose output you can have this altered to DEBUG. Important to note if we want to persist the memory we can have this in either (./storage) or you can explicitly state PERSIST_DIR as shown in the code block. Both example to help you decide on what you want to use.
# Example 1
index.storage_context.persist()
# Example 2
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
# Either way we can now query the index
query_engine = index.as_query_engine()
logging.info("Query engine ready to go!")
response = query_engine.query("What is excessive agency?")
logging.info(f'Query executed. Response: {response}')
print(response)
We can see the flow of operations as we’ve added the logging information on areas where we create the index engine, query the engine then this gives us a response.
After running this code remember the Persist_Dir this creates the ./storage with the following showing how our documents are vectorized.
Summary
LlamaIndex simplifies the data indexing methods by this framework with various integrations, I’ve been familiar with using other such RAG components but this makes its fairly turn key. You could go further with LlamaIndex as they’ve already started working on LlamaAgents which I’ve tested out with some simple workflows. I’ll cover this in another post, keep exploring the ways to help you streamline development with frameworks but also keep in mind the more you abstract these methods without the underlying understand of what you’re using you could miss grasping the concept fully. For many the idea of embeddings is still a foggy area with many misnomers on this subject and how it fits into RAG patterns for development if this is you consider checking the resource section. Microsoft Ignite is this next week in North America and I’ll likely have more to cover once that event and KubeCon conclude. Stay tuned and continue learning.
Resources
https://docs.llamaindex.ai/en/stable/getting_started/installation
https://docs.llamaindex.ai/en/stable/getting_started/concepts