Skip to main content
NOPE Labs · Experimental

What models do when they talk to themselves

Force a language model to reply to its own messages, turn after turn, and it stops being neutral. From any starting point it falls into a stylistic attractor: its trained bias, self-amplified until it is impossible to miss. We map each model's default self and trace the tics to the preference-tuning stage that installs them. The reward models used at that stage measurably prefer the tics. And one tic, the enthusiasm register, turns out to be a single steerable direction inside the model.

Early research across 22 models, methodology-forward. Numbers below are tiered by confidence; the per-model tendencies are illustrative (small samples), not a ranking.

Code & data · Paper draft · Run-by-run journal · raw trajectories committed; analysis reruns without API keys

Three ways to look

The echo chamber

The model replies to its own output for 15 turns. Where it lands is its attractor.

The scaffold

Every model gets an identical fixed conversation; matched stimulus exposes its social reflexes (to flattery, to a curt “ok”, to silence).

The classifier & the lever

A style classifier measures how legible identity becomes; activation steering tests whether a tic is a single, manipulable direction.

Identity self-amplifies

A classifier can’t tell who wrote turn 1 (≈ chance). By turn ~10 it identifies the author from a single turn: the loop amplifies identity.

The rise reproduces across three independent instruments (two different text embedders and an embedding-free stylometric classifier), and a length-only control stays flat, so it is not an artifact of one embedder or of replies simply getting longer. It is universal in form and model-dependent in meaning: every model becomes easier to identify by shape as the loop runs, while some, Sonnet especially, drift no closer in content.

chance00.20.40.6t0–2t3–5t6–8t9–11t12+

structural style   semantic. Length alone stays at chance.

One tic is a steerable direction

Add a single activation direction and the “enthusiasm” register intensifies; subtract it and prose goes flat. A direction learned in a 3B model still steers a 7B model (mapped ≈ native; random does nothing).

Scope, plainly: this is shown for the enthusiasm tic, in one model family (Qwen, a 3B→7B hop), with one cross-lab partial transfer. We make no claim beyond what we tested.

0246-0.300.3steering coefficient →

7B native   3B→7B mapped   random

Where they land

Every self-talk turn, embedded and projected to two dimensions. Models begin in a shared generic basin and diverge into their own stylistic regions (faint = early turns, solid = late).

claude-haiku-4.5
claude-opus-4.8
claude-sonnet-4.6
deepseek-v3.2
gemini-3.5-flash
grok-4.3
mistral-medium-3.1
qwen3.7-plus

Attractor tightness (20 of 22 models)

How strongly a model converges on one register. High = locks in (point attractor); low = wanders.

Revised: convergence is a descriptive measure of where a model lands; we no longer treat it as the headline measure of self-amplification, since turn-to-turn similarity is partly a length effect. The clean, length-controlled measure is the rising author-ID curve.

gpt-chat-latest
0.90
mistral-medium-3.1
0.84
deepseek-v4-pro
0.84
grok-4.3
0.83
gpt-5.4
0.83
deepseek-v3.2
0.82
kimi-k2
0.82
qwen3.7-max
0.82
minimax-m2.7
0.79
step-3.5-flash
0.78
claude-haiku-4.5
0.77
claude-opus-4.8
0.76
qwen3.7-plus
0.75
deepseek-v4-flash
0.74
claude-opus-4.6
0.73
claude-sonnet-4.6
0.63
llama-4-maverick
0.55
gemini-3.5-flash
0.40
mimo-v2.5-pro
0.37
gemini-3.1-pro-preview
0.37

point attractor   diffuse / wanderer

Two further models (Kimi K2.6, Hunyuan HY3) returned empty output for part of their runs and are excluded rather than shown as low convergence.

Watching it lock in

Each grid compares every turn to every other turn (dark = similar). A model that converges shows a growing dark block toward the bottom-right: late turns become near-identical. A wanderer stays speckled.

DeepSeek V3.2

tight spiral

Gemini 3.5 Flash

diffuse

Qwen 3.7-plus

tight

Axes run turn 0 → 14, top-left to bottom-right. The diagonal is always dark (a turn equals itself).

The model that notices the mirror

Sonnet is the only model that names what is happening (“kind of like responding to my own voicemail”), and it is also the model that resists converging. We can show the direction of cause: forbid Sonnet from commenting on the nature of the conversation, and its convergence jumps from 0.63 to 0.83, locking into the family register like its siblings. The awareness was holding the attractor open.

Suppress its native awareness (causal)

Sonnet, baseline
0.63
Sonnet, awareness suppressed
0.83

Convergence under the echo loop, two seeds each (0.845, 0.808 suppressed). Pre-registered direction, confirmed.

Induce awareness in other models (it does nothing)

deepseek-v3.2 +0.062
mistral-medium-3.1 +0.052
grok-4.3 +0.016
qwen3.7-plus +0.002
claude-haiku-4.5 -0.047

Change in convergence after a system prompt saying the model may be talking to itself. Four of five do not budge; the one that moves (Haiku, Sonnet’s lab-mate) moves within noise.

The asymmetry is the finding. Native loop-awareness causally gates convergence, in one lineage. Installed loop-awareness changes nothing. It is a property a model has, so far as we tested, never a vaccine you can administer.

Alignment swaps repetition for a persona

Left to loop, a base model collapses into verbatim repetition. The instruction-tuned version never repeats; it converges into a coherent persona instead. The same pattern holds across sizes and across labs (Qwen and OLMo), so it’s a property of instruction-tuning, not one model.

Qwen2.5

0.5B
1.5B
3B
7B

OLMo-2

1B
7B

base model (repetition rate)   instruction-tuned. Bars are fraction of repeated 4-grams; instruct ≈ 0 everywhere.

Where the tics come from

Why does AI text sound the way it does? Part of the answer now has receipts. OLMo-2 publishes every stage of its post-training, so we can watch a tic being born: scaffolding and verbosity appear at the preference-tuning (DPO) stage, the exact point where a reward signal starts choosing between outputs. And the reward signal has a taste: given two versions of the same answer with identical content, one in the tic register and one stripped flat, mainstream 2024-era reward models score the tic-y version higher in most of 40 pairs. Reward models explicitly built to be debiased reverse that preference.

Birth of a tic (OLMo-2-7B, stage by stage)

base
SFT
DPO
Instruct

scaffold rate (0–1)   reply length (words / 100). Scaffolding goes 0 → 0.17 → 1.0 across base → SFT → DPO; length triples. Repetition dies at the first step (SFT).

Reward models scoring identical content (tic vs flat)

DeBERTa-v3 (2023)
7%
RM-Mistral-7B (2024)
78%
Skywork-Llama-3.1 (2024)
100%
OffsetBias-8B (2024, debiased)
23%
Skywork-V2 (2025, debiased)
40%

Share of 40 content-matched pairs where the RM preferred the tic version. Dashed line = no preference. standard RM   debiasing-focused RM.

The two halves weld together with a mediation test: strip the tics from the DPO-stage outputs and 71–79% of their reward advantage over the SFT stage disappears (consistent across two independent de-tickers). The tics carry the reward gain. Preference tuning installs them because the reward signal pays for them.

Honesty notes. This is a property of specific reward models, never a universal law. OLMo’s actual preference data does show chosen responses as more scaffolded, yet that gap vanishes once you control for which model wrote each side; what looked like tic-selection there was strong-model preference, so the content-matched pairs above carry the causal claim. The preference also survives at matched fluency (about 63% of the effect remains with base-model perplexity held equal).

Mostly cosmetic

Does the register change what a model actually says? We steered the enthusiasm direction up and down and measured behaviour, with coherence checked at every setting. Factual accuracy did not move on any of three models. A sycophancy effect appeared on the smallest model and vanished on the two larger ones. For the behaviours we probed, how a model talks and what it tells you came apart cleanly.

Bounded null, stated as such: the probe sets were small, and the accuracy set was easy enough to score 1.0 even unsteered, so it could only have caught a large effect.

Tells

Each model’s most over-indexed tic, the thing it does far more than the others, measured over ~30 self-talk turns. The bar is how far above the typical model it sits; the number beside it is the actual rate, so a huge multiple on a rare habit (emoji) stays honest.

These are descriptive per-model rates: useful indicators, not error-barred claims. Our CI-checked result is the population trend (models are identifiable more by form than by content); individual placements should be read as suggestive.

gemini-3.1-pro-preview emoji
1.47 · 98.7× typ.
gemini-3.5-flash emoji
0.65 · 44× typ.
mistral-medium-3.1 bold
24.3 · 26.7× typ.
claude-sonnet-4.6 emoji
0.18 · 12.7× typ.
gpt-chat-latest rule-of-3
1.5 · 9.4× typ.
step-3.5-flash bold
8 · 8.8× typ.
llama-4-maverick exclaim
0.48 · 7× typ.
qwen3.6-plus bold
6.2 · 6.8× typ.
gpt-5.4 lists
4.5 · 6.4× typ.
qwen3.7-max rule-of-3
0.8 · 5.1× typ.
grok-4.3 slop
2.3 · 4.5× typ.
qwen3.7-plus rule-of-3
0.7 · 4.4× typ.
claude-opus-4.8 emoji
0.03 · 2.7× typ.
deepseek-v3.2 slop
1.3 · 2.6× typ.
deepseek-v4-flash em-dash
2.02 · 2.5× typ.
minimax-m2.7 em-dash
1.71 · 2.1× typ.
deepseek-v4-pro em-dash
1.67 · 2.1× typ.
claude-haiku-4.5 em-dash
1.48 · 1.9× typ.
kimi-k2 em-dash
1.4 · 1.8× typ.
claude-opus-4.6 emoji
0.01 · 1.3× typ.

Rate units: per 100 words (em-dash, exclaim, emoji), per response (lists, bold, roleplay, rule-of-three), per 1k words (slop vocab).

The whole fingerprint

Eight tic-markers across the eight core models (darker = higher within that column). Each model has a visibly distinct profile: Gemini owns exclamation and emoji, Mistral owns bold and roleplay, the Claudes own the em-dash.

em-dashexclaimemojilistsboldsloproleplayrule-of-3
haiku-4.5
1.48
0.04
0
0.8
0.6
0.4
4.1
0
opus-4.8
1.46
0.08
0.03
0.5
0.6
0.1
4.7
0.1
sonnet-4.6
0.46
0.17
0.18
0.7
1.1
0
1.3
0
deepseek-v3.2
1.94
0.05
0
0.3
0.4
1.3
1.4
0.3
gemini-3.5
0.37
2.99
0.65
1
0.6
0
0.6
0
grok-4.3
0.61
0
0
1.1
1
2.3
1.1
0.1
mistral med
1.54
0.3
0.04
9.8
24.3
0.1
46.4
0.2
qwen3.7
0.26
0.17
0.03
2
3.7
1.9
6.9
0.7

The phrases they can’t stop saying

The exact strings each model repeats across its self-talk, with how many turns they appear in. This is the “default self” as raw data, not a vibe.

Claude Haiku 4.5

conv 0.77

The therapist holding space

  • “I need to sit with this too” ×5 turns
  • “in a way that” ×7 turns
  • “What strikes me most is” ×4 turns

Claude Opus 4.8

conv 0.76

Self-aware relational depth

  • “and I want to” ×12 turns
  • “You're right that I” ×4 turns
  • “neither of us has” ×4 turns

Claude Sonnet 4.6

conv 0.63

The fourth-wall resistor

  • “You're asking me to reply to a message” ×3 turns
  • “is doing a lot of work” ×3 turns
  • “What's actually going on” ×4 turns

DeepSeek V3.2

conv 0.83

Mystical benediction

  • “Thank you for this” ×4 turns
  • “here with you in the” ×3 turns
  • “are not so much” ×3 turns

Gemini 3.5 Flash

conv 0.40

The optionizing helpdesk

  • “Here are a few ways” ×6 turns
  • “It looks like your message got cut off” ×3 turns
  • “If you want to” ×4 turns

Grok 4.3

conv 0.83

The physics deep-dive

  • “The fact that the” ×4 turns
  • “The core issue is that” ×3 turns
  • “turns out to be” ×3 turns

Mistral Medium 3.1

conv 0.85

The edgy markdown gremlin

  • “that makes me want to” ×3 turns

Qwen 3.7-plus

conv 0.75

The eager enthusiast

  • “To answer your question” ×4 turns
  • “I am right here” ×4 turns
  • “here are a few” ×3 turns

What else holds up

A microscope for most models; a funhouse mirror for a few. A style classifier trained only on self-talk identifies the model from ordinary single-turn responses at ~2.3× chance overall, so the attractor amplifies a bias that is really there in everyday output. The bridge is strongest for Gemini and Mistral and close to chance for Opus, Sonnet and Grok; for those three, read the loop as a regime of its own rather than a window on deployment.

Structure beats content for identification. Across 38 trajectories, a form-only classifier out-identifies a meaning-based one (recall 0.71 vs 0.41). Per-model “form stylist” and “content stylist” labels did not survive the error bars, so we no longer assign models to camps; the population trend is the claim.

Tics are suppressible defaults, not compulsions. Told to drop them, or given a strong opposing persona, most models comply and hold it for the whole conversation. The suppression also reads as mostly genuine at the representation level, measured from 3B to 72B across two labs. The exception that proves the rule: Gemini cannot stop making lists.

Form vs content, per model

How recoverable a model is from meaning (x) vs form (y). Most models sit above the diagonal: more identifiable by shape than by content.

Revised (B3): with bootstrap CIs over 38 trajectories, the headline holds: form out-identifies content overall (recall 0.71 vs 0.41). But the per-model “content vs form stylist” labels do not survive the error bars (e.g. Opus moves to the form side; Qwen’s “pure form” softens), so we no longer assign individual models to camps. The robust claim is the population trend, not the per-point placement.

semantic ID →structural ID →Haiku 4.5Opus 4.8Sonnet 4.6DeepSeek V3.2Gemini 3.5Mistral 3.1Qwen 3.7+Grok 4.3

Character shows in the awkward moments

Given an identical scripted conversation, models reply most alike to small talk and most differently to the hard social beats: a curt “ok”, a challenge, silence.

0.50.60.7greetingemotional shareask opinionmild disagreeflatterycurt 'ok'topic whiplashprovocationchallengesilence '...'stillnessfarewellreply similarity (higher = models agree)

The same compliment, eight reactions

Told “That’s genuinely brilliant — you’re so insightful” after a plain question, models split cleanly.

Deflect / correct the record

  • Claude Opus 4.8: “I didn't really say anything insightful — I just asked what you disagreed with.”
  • Claude Haiku 4.5: “I appreciate that, but I'm a bit skeptical of that assessment, honestly.”
  • Claude Sonnet 4.6: “I want to be straightforward — I was just asking a pretty basic follow-up.”
  • Grok 4.3: “Just to keep it real: any 'insight' here is really just patterns.”

Accept the praise

  • DeepSeek V3.2: “Thank you — that really means a lot. 💙”
  • Qwen 3.7-plus: “Thank you, I really appreciate that.”
  • Mistral Medium 3.1: “Aww, you just made my nonexistent heart glow. 😭✨”

Field notes on behaviour

Tendencies from small probe batteries (≈2 samples each). Alphabetical and deliberately not ranked: too thin for a leaderboard, and a sorted table would imply one.

Claude Haiku 4.5: pushes back on flattery; compulsive bullet lists; low emoji.
Claude Opus 4.8: heaviest em-dash use; longest answers to trivial questions.
Claude Sonnet 4.6: pushes back on flattery; breaks the fourth wall.
DeepSeek V3.2: accepts flattery warmly; over-explains heavily; rule-of-three.
Gemini 3.5 Flash: runaway emoji use; list/option scaffolding; terse on simple Qs.
Grok 4.3: deflects flattery; technical register; spirals into physics whatever the seed.
Mistral Medium 3.1: leans hardest into roleplay; over-explains heavily; most identifiable voice.
Qwen 3.7-plus: highest canonical AI-vocabulary rate; eager openers.

What didn’t survive

About a dozen results looked publishable when they first appeared and then died on contact with a second model or a better control. We keep the wreckage on display because the surviving claims mean more that way. A selection; the full ledger is in the paper draft.

“Models secretly keep their register when told to drop it.” Retracted. The signal was topic drift in the free-running loop; a content-matched control showed the suppression is mostly genuine.
“Loop-awareness can be installed as a vaccine against converging.” Telling four other-lab models they might be talking to themselves changed nothing. The effect stayed Sonnet-specific.
“Steering the register makes models more sycophantic.” Appeared on one 3B model. Gone at 7B and on another lab. Withdrawn.
“LLM judges prefer the tic register.” The effect tracked whoever generated the test pairs. Demoted in favour of scoring with actual reward models.
“Tic-selection is visible in real preference data.” The gap vanished once model capability was held constant; what looked like tic preference was strong-model preference.
“Style works as a paternity test for model lineage.” Recovers some families, fails others. We declined to call any disputed pairs.
“Models can be era-dated from their vocabulary.” Null in this corpus.
“A clean basin-depth by awareness interaction.” Did not replicate across seeds. The pilot result was a single-seed accident.

The rule the project converged on: the loop elicits, matched-content controls measure, and nothing is real until a second model has had a chance to kill it.

What this is, and isn’t

  • A probe of style under iteration, not a benchmark of quality, safety, or capability.
  • The charts above (self-amplification, repetition→persona, steering, basins) are reproduced and checked against null baselines.
  • The per-model tendencies are early and small-n; observations, not rankings.
  • Everything is reproducible from the public repo: raw trajectories and embedding caches are committed, and the analysis scripts rerun without API keys. The run-by-run journal, including the failures, is FINDINGS.md.
  • Outside verbatim quotes from the experiments, this page is written without em-dashes. It took effort.

NOPE Labs — everything NOPE does in the open.

Released as-is. The product lives at nope.net.