About
Data Sources
- Pretraining checkpoint data: Models with additional pretraining on state scripted media, non-scripted state controlled media, and CulturaX corpora, evaluated at 10 checkpoints (100–1000 steps) using GPT-4o as judge.
- Memorization phrases: 2,000 phrases selected via LASSO regression from state coordinated media and CulturaX corpora, identifying the most predictive text segments.
- Model audit data: Automated queries to production LLMs via OpenRouter, comparing English and Chinese responses to politically sensitive prompts.
Methodology
For full details on the experimental design, model fine-tuning procedure, and evaluation methodology, please refer to the paper in Nature.
Contact
The replication archive (data and code) is available at the Harvard Dataverse.