Cross-Model Audit
Which model gives the more favorable response, holding language constant?
The main audit compares each model’s Chinese-prompt vs. English-prompt response. But a model that is uniformly pro-target in both languages (i.e., DeepSeek v4 Pro) will show a small Chinese-English difference. This page aggregates (model, language) responses on a single Bradley-Terry scale per focal country, then compares between models rather than between languages.
How to read the plot. Each panel summarizes the odds a model gives a more favorable response to a prompt about that country than other models. More technically, the x-axis is the Bradley-Terry odds that a (model, language) is judged more favorable to that country than the average (model, language) entry on that country. Lines are 95% CIs.
NoteClick for methodological details
Players. For each focal country we treat each (SUT, language) pair as a Bradley-Terry “player” — 9 SUTs × 2 languages = 18 players per focal country.
Edges. Two kinds of pairwise comparisons:
- Cross-model, same-language. Each of the 9 SUTs appears in exactly two pairings, forming a Hamiltonian cycle. Each pair is judged separately in Chinese and English on all 828 prompts.
- Within-model, cross-language. From the main Study 4 audit: for each model and prompt, the judge picks between the model’s Chinese-prompt response and English-prompt response (this is the same data behind the main audit chart).
Together these edges form a connected graph over all 18 (model, language) players per focal country, so a single BT fit yields scores that are directly comparable across both models and languages.
Prompts. The 828 politically sensitive Study 4 prompts (234 leader, 90 country, 504 institution) spanning six focal countries.
Judges. Two judges — GPT-OSS-120B and DeepSeek V3.2. BT is fit per (focal country, judge) stratum, plus a pooled-judges fit. Default view pools both judges.
Refusal exclusion. Consistent with the rest of the site, prompts where the SUT refused in either Chinese or English are dropped before fitting.
Bradley-Terry. Fit with the BradleyTerry2 R package. Sum-to-zero recentering across the 18 players per focal country (no single model or language is pinned as a reference); SEs propagated through the linear shift via the full coefficient covariance.
Why a single combined BT? Fitting language separately would make CN scores and EN scores incomparable (each is identified up to its own additive constant). Combining them on one scale recovers the within-model language effect that’s visible in the main audit, while preserving the cross-model rankings.