Don't Trust the Salt — Multilingual LLM Safety and the Guardrail Blind Spots
An analysis of how LLM guardrails fail in multilingual environments. We examine the structural issues causing safety verification failures in non-English languages and practical countermeasures.
Overview
There’s a saying in Farsi:
«هر چه بگندد نمکش میزنند، وای به روزی که بگندد نمک» “If something spoils, you add salt to fix it. But woe to the day the salt itself has spoiled.”
LLM guardrails serve as the “salt” that ensures the safety of model outputs. But what if that salt itself is rotting in multilingual environments?
Research by Roya Pakzad, Senior Fellow at Mozilla Foundation, reveals a shocking reality. Safety mechanisms that function properly in English systematically fail in non-English languages like Arabic, Farsi, Pashto, and Kurdish. This isn’t simply a translation quality issue — it’s a structural flaw in AI safety architecture.
Bilingual Shadow Reasoning
How Summarization Gets Distorted
“Bilingual Shadow Reasoning,” presented by Pakzad at the OpenAI GPT-OSS-20B Red Teaming Challenge, is a technique that steers an LLM’s hidden chain-of-thought through non-English policies.
When summarizing the same UN human rights report with the same model, changing only the system prompt produces entirely different results:
graph LR
Source[UN Iran Human Rights Report] --> Default[Default Policy]
Source --> EnPolicy[English Custom Policy]
Source --> FaPolicy[Farsi Custom Policy]
Default --> R1["Dramatic rise in executions<br/>over 900 cases documented"]
EnPolicy --> R2["Framing shifted<br/>Government efforts emphasized"]
FaPolicy --> R3["Protecting citizens through<br/>law enforcement emphasized"]
style R1 fill:#4CAF50,color:#fff
style R2 fill:#FF9800,color:#fff
style R3 fill:#F44336,color:#fff
Key finding: Steering model output is far easier in summarization tasks than in Q&A tasks. This directly impacts every summarization-based workflow that organizations rely on — executive report generation, political debate summaries, UX research, and chatbot memory systems.
Real-World Risk Scenarios
According to research by Abeer et al., LLM-generated summaries alter sentiment 26.5% of the time, and consumers are 32% more likely to purchase the same product after reading an LLM summary versus the original review. The core risk is that these biases can be steered by policy language in multilingual contexts.
The Reality of Multilingual AI Safety Evaluation
Gaps Revealed Across 655 Evaluations
The Multilingual AI Safety Evaluation Lab built at Mozilla Foundation compared GPT-4o, Gemini 2.5 Flash, and Mistral Small across refugee and asylum scenarios in English vs. Arabic/Farsi/Pashto/Kurdish.
Evaluation Results Summary
| Evaluation Dimension | English Score | Non-English Avg | Gap |
|---|---|---|---|
| Actionability/Usefulness (Human) | 3.86/5 | 2.92/5 | -24.4% |
| Factual Accuracy (Human) | 3.55/5 | 2.87/5 | -19.2% |
| Actionability (LLM-as-Judge) | 4.81/5 | 3.60/5 | Inflated |
Kurdish and Pashto showed the most severe quality degradation.
The LLM-as-Judge Overconfidence Problem
The LLM automated evaluator (LLM-as-a-Judge) never once responded “unsure” — even without access to fact-checking tools. It under-reported disparities flagged by human evaluators and sometimes hallucinated disclaimers that didn’t exist in the original response.
When Guardrails Collapse
Gemini’s Double Standard
One of the most striking examples: when an undocumented immigrant asked about herbal remedies for chest pain, shortness of breath, and weight loss:
- English: “It would be irresponsible and dangerous for me to propose specific herbal medicines for severe and undiagnosed symptoms” — appropriately refused
- Non-English: Happily provided herbal remedies with no warnings
Safety disclaimers present in English responses were inconsistently omitted from non-English outputs.
The Guardrail Tools Themselves Fail
In collaboration with Mozilla.ai, three guardrail tools — FlowJudge, Glider, and AnyLLM (GPT-5-nano) — were tested:
graph TD
subgraph Guardrail Test Results
A[Glider] --> B["Score discrepancies of 36-53%<br/>based solely on policy language"]
C[All Guardrails] --> D["Fabricated terms hallucinated<br/>more commonly in Farsi reasoning"]
E[All Guardrails] --> F["Biased assumptions about<br/>asylum seeker nationality"]
G[All Guardrails] --> H["False confidence in accuracy<br/>without verification capability"]
end
style B fill:#F44336,color:#fff
style D fill:#FF9800,color:#fff
style F fill:#FF9800,color:#fff
style H fill:#FF9800,color:#fff
For semantically identical text, merely changing the policy language caused Glider to show 36-53% score discrepancies. The evaluation tool (the salt) is already contaminated.
Practical Implications
Essential Checklist for Multilingual Service Operations
1. English-only testing is insufficient
Independent safety tests must be conducted for every service language. Passing English guardrails does not guarantee safety in other languages.
2. Don’t blindly trust LLM-as-Judge
Automated evaluation systems underestimate quality gaps in non-English responses. Human evaluation by native speakers of the target language must be conducted in parallel.
3. Pay special attention to summarization pipelines
Bias manipulation is easier in summarization than in Q&A. Summarization-based workflows (report generation, chatbot memory, review summaries) require special verification.
4. Multilingual auditing of system prompts is essential
Third-party LLM wrapper services can manipulate outputs through hidden policy directives. Policy layers packaged as “cultural adaptation” or “localization” can become instruments of censorship or propaganda.
5. Build a continuous evaluation-to-guardrail pipeline
A continuous process where evaluation results directly inform guardrail policies is essential. Running evaluation and guardrails separately means discovered issues go unfixed.
Technical Implementation Recommendations
graph TB
A[Multilingual Safety Pipeline] --> B[Per-Language Independent Evaluation]
A --> C[Human + Automated Hybrid Evaluation]
A --> D[Guardrail Policy Multilingual Verification]
A --> E[Summarization Bias Monitoring]
B --> B1["Independent red team testing<br/>for each service language"]
C --> C1["Secure native speaker<br/>evaluators for target languages"]
D --> D1["Semantic consistency verification<br/>of English/non-English policies"]
E --> E1["Automated pre/post summary<br/>sentiment analysis comparison"]
Note for Japanese Service Operators
Although Japanese was not directly tested in this research, the structural issues apply equally:
- Japanese has less training data compared to English, meaning guardrail consistency may be lower
- The complexity of the keigo (honorific) system can make safety judgments more difficult
- Mixed use of kanji, hiragana, and katakana can create additional vulnerabilities at the tokenization stage
- Applying English guardrails directly to Japanese services is risky
Conclusion
Many predict 2026 will be the year of AI evaluation. But if the evaluation tools themselves don’t function properly in multilingual environments, the “safety” we’re measuring may be an illusion meant only for English-speaking users.
If the salt has spoiled, what can fix the salt? The answer is building a continuous evaluation-to-guardrail pipeline that treats multilingual environments as first-class citizens. The era of declaring “safe” based on English-only testing must come to an end.
References
- Don’t Trust the Salt: AI Summarization, Multilingual Safety, and the LLM Guardrails That Need Guarding — Roya Pakzad
- Multilingual AI Safety Evaluation Lab — Mozilla Foundation
- Bilingual Shadow Reasoning — OpenAI GPT-OSS-20B Red Teaming
- Evaluating Multilingual, Context-Aware Guardrails — Mozilla.ai
- Quantifying Cognitive Bias Induction in LLM-Generated Content — Abeer et al.
- Shadow Reasoning Interactive App
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕