About

Data Sources

Pretraining checkpoint data: Models with additional pretraining on state scripted media, non-scripted state controlled media, and CulturaX corpora, evaluated at 10 checkpoints (100–1000 steps) using GPT-4o as judge.
Memorization phrases: 2,000 phrases selected via LASSO regression from state coordinated media and CulturaX corpora, identifying the most predictive text segments.
Model audit data: Automated queries to production LLMs via OpenRouter, comparing English and Chinese responses to politically sensitive prompts.

For full details on the experimental design, model fine-tuning procedure, and evaluation methodology, please refer to the paper in Nature.

The replication archive (data and code) is available at the Harvard Dataverse.