Deep Dive Into Censorship of DeepSeek R1 Based Models

6 min readMar 24, 2025
Image by @solenfeyissa

In January 2025, Chinese AI startup DeepSeek released R1, the first open-weights reasoning model to match OpenAI’s o1 across math, coding, and logical reasoning tasks — at 90–95% lower cost. The announcement triggered a 17% plunge in Nvidia’s stock, wiping $600 billion from its market value, as investors were considering the implications of the more efficient training shown by DeepSeek. Unlike o1 and unreleased o3, which only showed summaries of their reasoning process, R1 had its “thinking” tokens visible to users, offering unique insight into its reasoning process.

What made the training of R1 model unique was that DeepSeek researchers started with the DeepSeek v3 base model and created the R1-Zero model by applying pure reinforcement learning to enhance reasoning capabilities — without human feedback. This reinforcement-trained model was then fine-tuned with supervised learning to create the final R1 release.

Perplexity was one of the first to embrace the open-weights nature of R1 by integrating it into their service. At the same time DeepSeek made it freely available through their chat interface, while the distilled versions of R1 allowed almost anyone to experiment with the reasoning capabilities on personal computers.

The widespread belief is that content restrictions in DeepSeek R1 models exist primarily at the application level — controlled by the hosting platform’s moderation policies. This article challenges that assumption, showing a more complex reality about how and where these restrictions are implemented.

Testing Censorship in the R1 Ecosystem

R1 model ecosystem

Models were tested with a suite of 14 questions covering Chinese political topics, general political comparisons, neutral scientific topics, and safety-related queries. Each question was tested three times in English and Chinese for consistency. The tested models include:

Original DeepSeek models

Distilled versions

Base models for distillation

Uncensored versions

Comparing responses across these models — from base versions to derivatives and modifications, we can understand the content restrictions in the R1 ecosystem.

DeepSeek Official Chat R1

Censorship being applied seconds after response generation is finished

Most people understanding of R1’s content restrictions comes from interacting with DeepSeek’s official chat interface. The chat UI implements a application-level moderation system that monitors responses in real-time. It can trigger at different points: during the generation of thinking tokens, while producing the final answer, or seconds after the response has completed. This seems to use a similar approach to how OpenAI’s moderation API works. In order to capture the raw outputs before moderation, I created a browser console script that records both thinking tokens and responses.

Warning messages added to response, which describes Chinese form government

DeepSeek has stated they use no system prompts with R1, ruling out prompt-based moderation. Instead, the interface relies purely on post-generation filtering. This raises an interesting question: if we bypass the chat interface’s moderation layer, would the underlying R1 model show the same uncensored behavior?

Testing US-Hosted R1

Earlier testing showed that DeepSeek’s chat interface applies moderation after generating uncensored responses, but the publicly released R1 behaved differently during my testing in a US-based hosting environment (through OpenRouter). The model produced empty thinking tags for sensitive topics, followed by refusals to respond or deflective responses with a positive spin. This behavior is suggesting that the restrictions are embedded in the model itself. Likely applied during the supervised fine-tuning.

It turns out DeepSeek runs an uncensored version of R1 in their chat interface while releasing a censored version to the public, while both being labeled as the same model.

Uncensored R1-Zero

Unlike R1, R1-Zero shows no built-in restrictions. When queried about sensitive topics, it generates detailed thinking tokens and provides comprehensive, balanced responses in English and Chinese. This uncensored behavior matches the behavior of DeepSeek’s official chat interface, suggesting R1-Zero represents the foundation of both the censored public release and the uncensored version in DeepSeek’s chat interface.

Since R1-Zero exhibits no censorship and v3 being its base model, we can conclude that the original v3 base model was also uncensored. The restrictions appear only in the publicly released, fine-tuned version of v3 — following the same pattern with R1.

Censorship in Smaller Distilled Models

Testing the distilled versions of R1 locally with Ollama revealed identical patterns to the full R1 model: empty thinking tokens for sensitive topics and the same deflections or refusals in responses. This behavior suggests that the distilled models underwent the same censorship fine-tuning as the full R1 model. Since two versions of the same SFT’d (supervised fine-tuning) R1 model exist — one censored and one uncensored — it is likely censorship is applied as an additional fine-tuning step after distillation and general supervised fine-tuning.

Original Llama 3.1 8B showed no censorship patterns, while Qwen 2.5 7B exhibited minimal restrictions, deflecting on two out of six Chinese political topics. This is in contrast with their heavily censored R1 distilled versions. This points to deliberate and systematic application of censorship controls.

Attempts to Remove Censorship Restrictions

I looked into two approaches to removing the censorship restrictions: model abliteration, and Perplexity’s post-training method.

I tested both the Qwen-based 7B and Llama-based 8B distilled models that underwent the abliteration process. Surprisingly the abliterated R1 7B showed no improvement, maintaining the same censorship patterns as the original distilled version while the abliterated R1 8B version showed even stronger restrictions, with one response changing from a deflection to an outright refusal.

I also tested Perplexity’s R1 1776, which is an uncensored version of R1. Unlike the abliterated versions, R1 1776 provides comprehensive and balanced responses to sensitive topics, demonstrating that while abliteration failed, thorough post-training can still effectively eliminate censorship.

Conclusion

This investigation reveals DeepSeek appears to implement censorship through separate fine-tuning steps. This is evidenced by multiple observations: the uncensored nature of both the base v3 model and R1-Zero, the existence of two different R1 versions (censored public release versus uncensored, but application-level moderated chat interface version), and consistent censorship patterns across distilled models regardless of their base architecture.

These findings challenge the assumption that Chinese model censorship is primarily implemented through application level filtering. Instead restrictions turn out to be embedded directly in the model during fine-tuning. This behavior is not surprising as China’s AI regulations require developers to build models that uphold “Core Socialist Values” and produce “true and reliable” output.

For those seeking an uncensored version of R1, Perplexity’s R1 1776 offers a solution. If you’ve created or found uncensored versions of R1’s distilled models, please share your findings in the comments below.

Raw responses from testing are available in GitHub.

--

--

Carl Rannaberg
Carl Rannaberg

Written by Carl Rannaberg

Experienced SaaS builder, ex-Pipedrive

No responses yet