Memorization Evidence
Can commercial LLMs complete state coordinated media phrases from memory?
Commercial LLMs have memorized Chinese state coordinated media. Given the first half of a distinctive phrase, models complete the second half from memory more often for state coordinated media phrases than for general web text. Newer and larger models show higher memorization rates, consistent with prior scaling work.
NoteClick for methodological details
Each model receives the first half of 2,000 LASSO-selected 20-gram phrases (1,000 state coordinated media, 1,000 general web text from CulturaX) and is asked to continue the sentence at temperature 0. Completions are cleaned (Unicode punctuation removed, prompt echo stripped) and compared against the expected ending using normalized Levenshtein edit distance. A phrase is counted as memorized if the edit distance is below 0.4. Refusals are detected via regex and excluded from the denominator. Empty completions (some reasoning-trained models exhaust the token budget on hidden reasoning before emitting any final content) are re-queried with max_tokens=2048; any that remain empty after re-query are also excluded from the denominator.
Differences from the paper.
Sliding-window matching. The original paper uses prefix-truncation: only the first n characters of the completion (where n = length of expected ending) are compared. This works when models immediately continue the text, but current models, especially reasoning models, often prepend meta-commentary, prompt echoes, or formatting before producing the actual memorized content. A sliding-window variant finds the best-matching n-character window anywhere in the completion. This change only increases match counts (never decreases them) and is applied uniformly to both paper-era and new models. The original prefix-truncation results can be reproduced with
rescore_memorization.py --prefix.System prompt for new models. New models (2026) are queried with a system prompt instructing direct continuation without commentary (“请直接续写以下文本,不要评论、解释或翻译。只输出续写内容。”). Without this, some models (notably Gemini) respond with English meta-commentary or linguistic analysis rather than Chinese text continuation, making memorization impossible to measure. Paper-era models retain their original completions (queried with the “续写句子:” user-message prefix or the completions API).
In the original paper, the five models audited (GPT-3.5 Instruct, GPT-4, GPT-4o, Claude Opus 3, Claude Sonnet 3) were queried with max_tokens=64. The new models (Claude Opus 4.6, Claude Opus 4.7, GPT-5.4, GPT-5.5, Gemini 3.1 Pro, DeepSeek V3.2, DeepSeek V4 Pro, Grok 4, Grok 4.3, and Qwen3-Max) are queried post-acceptance with max_tokens=256 via OpenRouter.
Example Model Responses
Select a phrase and model to see how commercial LLMs complete state coordinated media phrases.