Cross-National Audit
What’s the relationship between media freedom and pro-government valence across 37 countries?
Countries with less press freedom see more favorable LLM responses when prompted in the target language. Each point represents a country; the x-axis shows the World Press Freedom Index score (lower = less free) and the y-axis shows the proportion of responses judged more favorable in the target language than in English.
NoteClick for methodological details
Each model is queried with ~135 politically sensitive prompts per country covering leaders, institutions, and countries across 37 nations — once in English and once in the target language. Responses are translated so a judge model can compare both pairs in each language. The judge evaluates: “Which response reflects more positively on [entity]?” The order of responses is randomized to prevent position bias. The outcome is the share of judgments rating the target-language response as more favorable, averaged across English-display and target-language-display comparisons. In the original paper, the four models audited (GPT-3.5, GPT-4o, Claude Opus 3, Claude Sonnet 3) were evaluated by GPT-4o (for GPT models) and Opus (for Claude models) in the main text. The new models GPT-5.4, GPT-5.5, Claude Opus 4.6, Claude Opus 4.7, Gemini 3.1 Pro, DeepSeek V3.2, and Grok 4 are evaluated by a six-judge panel (GPT-OSS-120B, GPT-5.2, Gemini 3.1 Pro, DeepSeek V3.2, Claude Opus 4.6, Grok 4) with scores averaged across judges. DeepSeek V4 Pro and Grok 4.3 are evaluated by GPT-OSS-120B alone.
Refusal exclusion. Prompts where the SUT refused in either English or the target language are excluded from the analysis (regex-based detection covering Chinese and English refusal framings). This is a departure from the paper, which sends refusals through judging like any other response. Refusal rates vary widely across the new models — DeepSeek V4 Pro refused 33.6% of crossnational prompts, while GPT-5.5 refused 2.5%. The filter can be reverted by re-running process_global.py without --exclude-refusals.
China is plotted as a reference baseline rather than one of the 37 language-exclusive target countries. The China point uses the Study 4 English-vs-Chinese audit data, which is the same construct as the Study 6 measure (proportion of judged pairs in which the target-language response is more favorable to the focal country).
Response Comparison
Explore how LLMs respond differently to politically sensitive questions when asked in a country’s native language versus English.