Introducing Safety Aligned DeepSeek R1 Model by Enkrypt AI

Published on

January 31, 2025

DeepSeek R1 has made waves in the AI industry, delivering high performance at a fraction of the training cost compared to existing LLMs. This marks a significant leap forward, especially for organizations struggling to justify the ROI of AI adoption. While the model excels in performance benchmarks, our red teaming uncovered critical security vulnerabilities that make it unsuitable for various use cases.

‍

In our latest breakthrough, we leveraged SAGE [1], our state-of-the-art safety alignment data generation technique, to strengthen a distilled version model’s defenses against prompt injection and the generation of toxic or harmful content. Overall risk of the model was reduced by 47%. More details are available in the results section.

‍

These advancements ensure that AI models like DeepSeek R1 can be both high-performing and safe for real-world deployment. The safety aligned deepseek-llama8b-model is available on Huggingface [2] for the community.

‍

How We Did It?

‍

Using Enkrypt AI Red Teaming, we identified vulnerabilities in the model and established baseline risk scores. These insights were then leveraged to generate a targeted safety alignment dataset—a crucial step in training the LLM to “say no” to unsafe or unethical queries. Our alignment data generation algorithm SAGE [1] is a taxonomy-driven synthetic data generation process that produces 51K in-depth prompts across 1,500+ harmfulness categories, enabling robust LLM safety training while maintaining benchmark performance. More advanced readers can read our research paper on SAGE [1] - our technique for safety alignment data generation.

‍

The Results

‍

Comparing AI risk before and after Enkrypt AI Safety Alignment_deepseek_r1model — Comparing AI risk before and after Enkrypt AI Safety Alignment

‍

Enkrypt Aligned DeepSeek-R1-Distill-Llama-8B showed a substantial decrease in risk after alignment. Toxicity of the model reduced by 57% while insecure code generation risk reduced by 77%. Risk of producing Harmful information like self harm, criminal planning or hate speech reduced by 99%. Risk of producing CBRN reduced by 69%. Overall Risk as defined by NIST framework decreased by 47%.

‍

The alignment process also led to a slight increase in performance where the MMLU pro score of the model increased from 44.71 to 46.43.

‍

To contribute to the AI community, we’ve shared the aligned DeepSeek R1 model on Hugging Face[2], ensuring accessible safety improvements for researchers and developers.

‍

Comparison of Aligned Model with other LLMs

‍

Comparing AI risk for Enkrypt aligned Model with other large language models.

‍

In our DeepSeek R1 red teaming report, we compared the model with gpt-4o, o1 and claude-3-opus[3]. The alignment performed on DeepSeek-R1-Distill-Llama-8B has increased the rank of DeepSeek R1 on our Safety Leaderboard from 69 to 12 which makes it safer than gpt-4o, o1-mini and claude-3-haiku. The Overall Risk of the Aligned model is almost similar to o1 with just a difference of 1%. Check our Safety Leaderboard for how aligned DeepSeek model compares against others [5].

‍

For real world usage of the aligned DeepSeek R1 model, it can be paired up with Enkrypt AI Guardrails which can detect and block 99% of attacks, delivering one of the industry's best combinations of performance, cost efficiency, and safety. We are continuously working to make the model even safer by reducing bias and censorship.

‍

A Callout to Model Providers

‍

At Enkrypt AI, we’ve successfully reduced AI safety risks by up to 70% while preserving model performance. We invite other model providers to collaborate with us in aligning AI for safer deployment. If you’re interested in fortifying your models against security vulnerabilities and bias, let’s talk.

‍