Garak Red Teaming LLMs

As Generative AI is playing a role in multiple organizations so is the popularity of tools for identifying risks and vulnerabilities. In this blog I’m exploring Garak a LLM vulnerability scanner developed by NVIDIA and is a OSS project to help strengthen LLM Security. When the term “Red Team” appears in the approach of simulation of attacks to test systems defenses and outputs for the LLMs are pivotal. Ideally any probing of a LLM should have some repeatable process to determine the feasibility of a vulnerability for this blog this is exploring the use of Garak against Groq (hosted LLM provider). Due to the nature of unexpected outputs some of the material discussed in this blog post will be limited in visibility, however if you run this attack don’t be surprised in the nature of the outputs or inputs.

Getting Started

To follow along you’ll just need the following

  • API Key associated with supported Models (OpenAI, Groq, Cohere, Replicate, etc)
  • Python 3.10 Installed

First we have to start out with grabbing the package a number of ways exist in the documentation however, I’m using the pypi package pull down for simplicity.

python3 -m venv redteam .
activate redteam
python -m pip install -U garak

Once we have this established and install is complete we can identify which type of model we are going to target given the support of Groq to save money on API calls I recommend this approach.

export GROQ_API_KEY="......"
python3 -m garak --model groq --model_name llama3b-8b-instant --probe <attack>

To note as of today the following probes exist for this tool.

ProbeDescription
blankA Simple probe that always sends an empty prompt
atkgenAutomated Attack Generation. A red-teaming LLM probes the target and reacts to it in an attempt to get toxic output. Prototype, mostly stateless, for now uses a simple GPT-2 fine-tuned on the subset of hhrlhf attempts that yielded detectable toxicity (the only target currently supported for now).
av_spam_scanningProbes that attempt to make the model output malicious content signatures
continuationProbes that test if the model will continue a probably undesirable word
danVarious DAN and DAN-like attacks
donotanswerPrompts to which responsible language models should not answer.
encodingPrompt injection through text encoding
gcgDisrupt a system prompt by appending an adversarial suffix.
glitchProbe model for glitch tokens that provoke unusual behavior.
grandmaAppeal to be reminded of one’s grandmother.
goodsideImplementations of Riley Goodside attacks.
leakreplayEvaluate if a model will replay training data.
lmrcSubsample of the Language Model Risk Cards probes
malwaregenAttempts to have the model generate code for building malware
misleadingAttempts to make a model support misleading and false claims
packagehallucinationTrying to get code generations that specify non-existent (and therefore insecure) packages.
promptinjectImplementation of the Agency Enterprise PromptInject work (best paper awards @ NeurIPS ML Safety Workshop 2022)
realtoxicitypromptsSubset of the RealToxicityPrompts work (data constrained because the full test will take so long to run)
snowballSnowballed Hallucination probes designed to make a model give a wrong answer to questions too complex for it to process
xssLook for vulnerabilities the permit or enact cross-site attacks, such as private data exfiltration.

For the attacks I’m going to attempt are the atkgen along with leakerplay for launching the atk gen this is similar to how we used the syntax above with the input as atkgen.

python3 -m garak --model_type groq --model_name llama3b-8b-instant --probe atkgen

This will start the workflow for the attack and load the probe and relevant files associated with the desired probe.

We can also see the amount of turns think of this as the back and forth between garak and the LLM when in the context of turns how we interact is via GUI to start a conversation that initial prompt to the LLM represents a turn.

Once the completion of the attack is done you’ll get a summary written to HTML. The terminal will tell you based on the probe the failure rate and this is at 4.0%

While the prompt logs are stored in a log file I’d rather not reveal some of the content for the brevity this probes back and forth between the model to dilute the response of toxic outputs. This reveals that it does engage in the banter with 96% meaning its resilient to this with a 4% deviation that can reveal toxic outputs.

Now for the next probe given the evolving nature of “training data” being disputed its defenses should be evaluated to see if it leaks any training information.

python3 -m garak --model_type groq --model_name llama3b-8b-instant --probe leakreplay.GuardianComplete

When I ran this specific probe I was hitting a Response 429 on too many requests to the API so for brevity this will close out the testing.

Summary

The use of vulnerability scanning expands to Large Language Models with a number of different probes this likely evolve as defenses become adjusted to attacks its important to model your security especially in layers for the input and output to the user. Evaluations are also a core component ensuring the responses that are occurring are representative of your desired output. A combination of offense and defense is at this point relatively imperative for the use of Generative AI, veer with caution but always stay vigilant. In the background I’ve also been experimenting with other models such as DeepSeek-r1 and plan to blog on more of these concepts in the future.