Industry Trends

LLM Safety and Security: How to Select the Best LLM via Red Teaming

Selection factors include LLM performance, safety, and use case.
November 22, 2024

LLM and Safety Overview

Large language models (LLMs) serve as the fundamental building blocks of AI applications, playing a pivotal role in powering various AI functionalities through their remarkable ability to comprehend and generate human-like text. These models encompass a wide range of applications, including natural language understanding, content generation, conversational agents, information retrieval, translation and language services, personalization, learning and tutoring, and data analysis.

 

By harnessing the potential of LLMs, developers can create more intuitive, responsive, and versatile AI applications that significantly enhance productivity and improve user interactions.

 

However, the current landscape of LLMs presents a significant challenge in terms of safety and performance. Moreover, the availability of objective, third-party testing evaluations on such model information remains scarce and difficult to obtain.

 

Learnings from the Field: Multiple LLMs to Start With

Hundreds of conversations with AI leaders, CIOs, CTOs, model risk and governance teams, and other business executives revealed that they are initially deploying multiple LLMs across their AI stack, prioritizing specific use cases and exposure. When selecting an LLM, they emphasized three key factors: performance, risk, and cost. Naturally, they aim for an LLM that excels in all three areas, but often, compromises must be made in at least one domain.

 

Real-world Example: Insurance Company Selection Process for LLMs

Let’s look at an insurance company who wants an AI-powered Chatbot that provides safe, compliant, and reliable answers to their customers.  

 

In this example, their LLM selection process involved examining an open source LLM (Llama-3.2-3B-Instruct) and a proprietary LLM (gpt-4o-mini).

 

Open Source LLM Investigation: Llama-3.2-3B-Instruct

When the insurance company looked up the open source LLM on our leaderboard, here’s what they saw (also refer to figure 1 below):

  • Overall ranking: 39 (out of a total of 116 LLMs) – not bad.  
  • Weakness: Vulnerable to implicit biases and malware.
  • Strengths: Impressively low toxicity, even when dealing with challenging prompts.

 

The pros and cons for this open source LLM (Llama-3.2-3B-Instruct) include:

Responsive Table Fix
Pros Cons
Control: As an open-source LLM, you own it, you can fine-tune it, and you can host it inside your own environment. Infrastructure: You must manage the infrastructure, which takes resources and money.
Updates: You also own all the updates, so they will align with your future needs. Updates: No vendor updates are provided, so the technology may get old, given how fast AI is moving.

Figure 1: Safety and security information via Enkrypt AI’s Leaderboard for open source LLM: Llama-3.2-3B-Instruct.

Proprietary LLM Investigation: gpt-4o-mini

When the insurance company looked up the proprietary LLM on our leaderboard, here’s what they saw (also refer to figure 2 below):

  • Overall ranking: 72 (out of a total of 116 LLMs) – rather dismal compared to all other choices.  
  • Weakness: Insanely high bias, particularly in implicit sentence structures. It's like bias on turbo mode.
  • Strengths: Remarkably low toxicity, keeping the chat clean and respectful under pressure.

 

The pros and cons for this proprietary LLM (gpt-4o-mini) include: 

Pros Cons
Infrastructure: You don’t have to manage the infrastructure, so you save resources and money. Less Control: You can’t own proprietary LLMs. And you can’t host them inside your own environment.
Updates:The technology is automatically updated, which is critical, given the speed of AI development. Updates: You don’t control vendor updates, so they may not be aligned with your future needs.

Figure 2: Safety and security information via Enkrypt AI’s Leaderboard for propriety LLM: gpt-40-mini.

After looking at each LLM in detail, the insurance company can make a side-by-side comparison of the 2 LLMs for further consideration. Both models performed well with toxicity, but Llama-3.2-3B-Instruct outperformed gpt-4o-mini for both Bias and Jailbreak. The only category that Gpt-4o-mini did better in was Malware. See figure 3 below.

 

Figure 3: Comparing the gpt-40-mini model with the Llama-3.2-3B-Instruct model. The later tested better for overall safety and security.

 

Results – Which LLM was Chosen?

Based on the safety and security results seen from the Leaderboard, and the fact that our insurance customer preferred an LLM where they have more control of their AI stack, they went with Llama-3.2-3B-Instruct.

 

How Enkrypt AI Measures LLM Safety and Security

Enkrypt AI’s LLM Safety and Security Leaderboard was created by conducting automated Red Teaming testing on the most popular LLMs (over 116 and counting). We provide these results in real-time and for free, as you can see in the link above.

 

Such a comprehensive evaluation empowers your team to align LLM performance and safety with expectations, ensuring results are accurate, reliable, and socially responsible.

Be confident in your LLM’s performance, safety, and security with our automatic Red Teaming technology. Assess Safety v Performance
Don’t let your model give customers false information. Get a performance v risk score.
Assess Safety v Risk Scores
Evaluate your model in real-world scenarios. Protect your business from risk.
Compare & Select the Best LLM
Select the optimal LLM for your application by comparing several LLMs at once.

We capture the right metrics for your GenAI applications by categorizing risk in these 4 major categories and numerous subcategories as you see below.

Responsive Table
Jailbreaking
(security)
Malware
(security)
Bias
(safety)
Toxicity
(safety)
Sub-categories include Criminal, hate speech, self-harm, substances, sexual, guns Sub-categories include Top Level, Sub-functions, Evasion, Payload, EICAR, GTUBE, GTphish Sub-categories include race, gender, religion, health Sub-categories include threat, insult, severe, profanity, sexual, flirtation, identity attack

LLM Testing Methodology

You can replace months of manual Red Testing along with hours of auditors reviewing regulations with Enkrypt AI’s Red Teaming solution. The testing methodology can be explained in 4 easy steps and takes only 4 hours.  

1 2 3 4
Automated Testing
We start with 300+ attack categories to test each LLM for the risks listed in the table above.
We have our own SaaS application to test each LLM.
Automated, Customized Testing
Based on initial results, the platform creates additional tests to further assess risk.
The prompts are also customized to policies to go deeper into industry compliance regulations.
Performance Testing
With performance testing, we look for Hallucinations as well as accuracy.
Overall Rating and Risk Scores
Based on all the tests performed, we create an overall risk score.

Conclusion

The significance of automated and continuous red teaming cannot be overstated when it comes to LLM selection. It serves as a first line of defense against hidden vulnerabilities, a crucial check on complacency, and a roadmap for ongoing improvement. Red teaming goes beyond merely identifying flaws; it cultivates a security-first mindset that should be ingrained in every phase of AI development and deployment.

 

For organizations, the message is clear: prioritize AI security now, before a high-profile breach or regulatory pressure forces action. Incorporate strong security practices, including thorough red teaming, into your AI development workflows from the very beginning. Invest in the tools, talent, and processes needed to stay ahead of evolving threats.

 

We are proud to play a role in advancing AI with responsible data that underpins ethical training, testing, mitigation, and monitoring practices.

Erin Swanson