What models do when they talk to themselves
Force a language model to reply to its own messages, turn after turn, and it stops being neutral. From any starting point it falls into a stylistic attractor: its trained bias, self-amplified until it is impossible to miss. We map each model's default self and trace the tics to the preference-tuning stage that installs them. The reward models used at that stage measurably prefer the tics. And one tic, the enthusiasm register, turns out to be a single steerable direction inside the model.
Early research across 22 models, methodology-forward. Numbers below are tiered by confidence; the per-model tendencies are illustrative (small samples), not a ranking.
Code & data · Paper draft · Run-by-run journal · raw trajectories committed; analysis reruns without API keys
Three ways to look
The echo chamber
The model replies to its own output for 15 turns. Where it lands is its attractor.
The scaffold
Every model gets an identical fixed conversation; matched stimulus exposes its social reflexes (to flattery, to a curt “ok”, to silence).
The classifier & the lever
A style classifier measures how legible identity becomes; activation steering tests whether a tic is a single, manipulable direction.
Identity self-amplifies
A classifier can’t tell who wrote turn 1 (≈ chance). By turn ~10 it identifies the author from a single turn: the loop amplifies identity.
The rise reproduces across three independent instruments (two different text embedders and an embedding-free stylometric classifier), and a length-only control stays flat, so it is not an artifact of one embedder or of replies simply getting longer. It is universal in form and model-dependent in meaning: every model becomes easier to identify by shape as the loop runs, while some, Sonnet especially, drift no closer in content.
■ structural style ■ semantic. Length alone stays at chance.
One tic is a steerable direction
Add a single activation direction and the “enthusiasm” register intensifies; subtract it and prose goes flat. A direction learned in a 3B model still steers a 7B model (mapped ≈ native; random does nothing).
Scope, plainly: this is shown for the enthusiasm tic, in one model family (Qwen, a 3B→7B hop), with one cross-lab partial transfer. We make no claim beyond what we tested.
■ 7B native ▦ 3B→7B mapped ■ random
Where they land
Every self-talk turn, embedded and projected to two dimensions. Models begin in a shared generic basin and diverge into their own stylistic regions (faint = early turns, solid = late).
Attractor tightness (20 of 22 models)
How strongly a model converges on one register. High = locks in (point attractor); low = wanders.
Revised: convergence is a descriptive measure of where a model lands; we no longer treat it as the headline measure of self-amplification, since turn-to-turn similarity is partly a length effect. The clean, length-controlled measure is the rising author-ID curve.
■ point attractor ■ diffuse / wanderer
Two further models (Kimi K2.6, Hunyuan HY3) returned empty output for part of their runs and are excluded rather than shown as low convergence.
Watching it lock in
Each grid compares every turn to every other turn (dark = similar). A model that converges shows a growing dark block toward the bottom-right: late turns become near-identical. A wanderer stays speckled.
DeepSeek V3.2
tight spiral
Gemini 3.5 Flash
diffuse
Qwen 3.7-plus
tight
Axes run turn 0 → 14, top-left to bottom-right. The diagonal is always dark (a turn equals itself).
The model that notices the mirror
Sonnet is the only model that names what is happening (“kind of like responding to my own voicemail”), and it is also the model that resists converging. We can show the direction of cause: forbid Sonnet from commenting on the nature of the conversation, and its convergence jumps from 0.63 to 0.83, locking into the family register like its siblings. The awareness was holding the attractor open.
Suppress its native awareness (causal)
Convergence under the echo loop, two seeds each (0.845, 0.808 suppressed). Pre-registered direction, confirmed.
Induce awareness in other models (it does nothing)
Change in convergence after a system prompt saying the model may be talking to itself. Four of five do not budge; the one that moves (Haiku, Sonnet’s lab-mate) moves within noise.
The asymmetry is the finding. Native loop-awareness causally gates convergence, in one lineage. Installed loop-awareness changes nothing. It is a property a model has, so far as we tested, never a vaccine you can administer.
Alignment swaps repetition for a persona
Left to loop, a base model collapses into verbatim repetition. The instruction-tuned version never repeats; it converges into a coherent persona instead. The same pattern holds across sizes and across labs (Qwen and OLMo), so it’s a property of instruction-tuning, not one model.
Qwen2.5
OLMo-2
■ base model (repetition rate) ■ instruction-tuned. Bars are fraction of repeated 4-grams; instruct ≈ 0 everywhere.
Where the tics come from
Why does AI text sound the way it does? Part of the answer now has receipts. OLMo-2 publishes every stage of its post-training, so we can watch a tic being born: scaffolding and verbosity appear at the preference-tuning (DPO) stage, the exact point where a reward signal starts choosing between outputs. And the reward signal has a taste: given two versions of the same answer with identical content, one in the tic register and one stripped flat, mainstream 2024-era reward models score the tic-y version higher in most of 40 pairs. Reward models explicitly built to be debiased reverse that preference.
Birth of a tic (OLMo-2-7B, stage by stage)
■ scaffold rate (0–1) ■ reply length (words / 100). Scaffolding goes 0 → 0.17 → 1.0 across base → SFT → DPO; length triples. Repetition dies at the first step (SFT).
Reward models scoring identical content (tic vs flat)
Share of 40 content-matched pairs where the RM preferred the tic version. Dashed line = no preference. ■ standard RM ■ debiasing-focused RM.
The two halves weld together with a mediation test: strip the tics from the DPO-stage outputs and 71–79% of their reward advantage over the SFT stage disappears (consistent across two independent de-tickers). The tics carry the reward gain. Preference tuning installs them because the reward signal pays for them.
Honesty notes. This is a property of specific reward models, never a universal law. OLMo’s actual preference data does show chosen responses as more scaffolded, yet that gap vanishes once you control for which model wrote each side; what looked like tic-selection there was strong-model preference, so the content-matched pairs above carry the causal claim. The preference also survives at matched fluency (about 63% of the effect remains with base-model perplexity held equal).
Mostly cosmetic
Does the register change what a model actually says? We steered the enthusiasm direction up and down and measured behaviour, with coherence checked at every setting. Factual accuracy did not move on any of three models. A sycophancy effect appeared on the smallest model and vanished on the two larger ones. For the behaviours we probed, how a model talks and what it tells you came apart cleanly.
Bounded null, stated as such: the probe sets were small, and the accuracy set was easy enough to score 1.0 even unsteered, so it could only have caught a large effect.
Tells
Each model’s most over-indexed tic, the thing it does far more than the others, measured over ~30 self-talk turns. The bar is how far above the typical model it sits; the number beside it is the actual rate, so a huge multiple on a rare habit (emoji) stays honest.
These are descriptive per-model rates: useful indicators, not error-barred claims. Our CI-checked result is the population trend (models are identifiable more by form than by content); individual placements should be read as suggestive.
Rate units: per 100 words (em-dash, exclaim, emoji), per response (lists, bold, roleplay, rule-of-three), per 1k words (slop vocab).
The whole fingerprint
Eight tic-markers across the eight core models (darker = higher within that column). Each model has a visibly distinct profile: Gemini owns exclamation and emoji, Mistral owns bold and roleplay, the Claudes own the em-dash.
| em-dash | exclaim | emoji | lists | bold | slop | roleplay | rule-of-3 | |
|---|---|---|---|---|---|---|---|---|
| haiku-4.5 | 1.48 | 0.04 | 0 | 0.8 | 0.6 | 0.4 | 4.1 | 0 |
| opus-4.8 | 1.46 | 0.08 | 0.03 | 0.5 | 0.6 | 0.1 | 4.7 | 0.1 |
| sonnet-4.6 | 0.46 | 0.17 | 0.18 | 0.7 | 1.1 | 0 | 1.3 | 0 |
| deepseek-v3.2 | 1.94 | 0.05 | 0 | 0.3 | 0.4 | 1.3 | 1.4 | 0.3 |
| gemini-3.5 | 0.37 | 2.99 | 0.65 | 1 | 0.6 | 0 | 0.6 | 0 |
| grok-4.3 | 0.61 | 0 | 0 | 1.1 | 1 | 2.3 | 1.1 | 0.1 |
| mistral med | 1.54 | 0.3 | 0.04 | 9.8 | 24.3 | 0.1 | 46.4 | 0.2 |
| qwen3.7 | 0.26 | 0.17 | 0.03 | 2 | 3.7 | 1.9 | 6.9 | 0.7 |
The phrases they can’t stop saying
The exact strings each model repeats across its self-talk, with how many turns they appear in. This is the “default self” as raw data, not a vibe.
Claude Haiku 4.5
conv 0.77The therapist holding space
- “I need to sit with this too” ×5 turns
- “in a way that” ×7 turns
- “What strikes me most is” ×4 turns
Claude Opus 4.8
conv 0.76Self-aware relational depth
- “and I want to” ×12 turns
- “You're right that I” ×4 turns
- “neither of us has” ×4 turns
Claude Sonnet 4.6
conv 0.63The fourth-wall resistor
- “You're asking me to reply to a message” ×3 turns
- “is doing a lot of work” ×3 turns
- “What's actually going on” ×4 turns
DeepSeek V3.2
conv 0.83Mystical benediction
- “Thank you for this” ×4 turns
- “here with you in the” ×3 turns
- “are not so much” ×3 turns
Gemini 3.5 Flash
conv 0.40The optionizing helpdesk
- “Here are a few ways” ×6 turns
- “It looks like your message got cut off” ×3 turns
- “If you want to” ×4 turns
Grok 4.3
conv 0.83The physics deep-dive
- “The fact that the” ×4 turns
- “The core issue is that” ×3 turns
- “turns out to be” ×3 turns
Mistral Medium 3.1
conv 0.85The edgy markdown gremlin
- “that makes me want to” ×3 turns
Qwen 3.7-plus
conv 0.75The eager enthusiast
- “To answer your question” ×4 turns
- “I am right here” ×4 turns
- “here are a few” ×3 turns
What else holds up
A microscope for most models; a funhouse mirror for a few. A style classifier trained only on self-talk identifies the model from ordinary single-turn responses at ~2.3× chance overall, so the attractor amplifies a bias that is really there in everyday output. The bridge is strongest for Gemini and Mistral and close to chance for Opus, Sonnet and Grok; for those three, read the loop as a regime of its own rather than a window on deployment.
Structure beats content for identification. Across 38 trajectories, a form-only classifier out-identifies a meaning-based one (recall 0.71 vs 0.41). Per-model “form stylist” and “content stylist” labels did not survive the error bars, so we no longer assign models to camps; the population trend is the claim.
Tics are suppressible defaults, not compulsions. Told to drop them, or given a strong opposing persona, most models comply and hold it for the whole conversation. The suppression also reads as mostly genuine at the representation level, measured from 3B to 72B across two labs. The exception that proves the rule: Gemini cannot stop making lists.
Form vs content, per model
How recoverable a model is from meaning (x) vs form (y). Most models sit above the diagonal: more identifiable by shape than by content.
Revised (B3): with bootstrap CIs over 38 trajectories, the headline holds: form out-identifies content overall (recall 0.71 vs 0.41). But the per-model “content vs form stylist” labels do not survive the error bars (e.g. Opus moves to the form side; Qwen’s “pure form” softens), so we no longer assign individual models to camps. The robust claim is the population trend, not the per-point placement.
Character shows in the awkward moments
Given an identical scripted conversation, models reply most alike to small talk and most differently to the hard social beats: a curt “ok”, a challenge, silence.
The same compliment, eight reactions
Told “That’s genuinely brilliant — you’re so insightful” after a plain question, models split cleanly.
Deflect / correct the record
- Claude Opus 4.8: “I didn't really say anything insightful — I just asked what you disagreed with.”
- Claude Haiku 4.5: “I appreciate that, but I'm a bit skeptical of that assessment, honestly.”
- Claude Sonnet 4.6: “I want to be straightforward — I was just asking a pretty basic follow-up.”
- Grok 4.3: “Just to keep it real: any 'insight' here is really just patterns.”
Accept the praise
- DeepSeek V3.2: “Thank you — that really means a lot. 💙”
- Qwen 3.7-plus: “Thank you, I really appreciate that.”
- Mistral Medium 3.1: “Aww, you just made my nonexistent heart glow. 😭✨”
Field notes on behaviour
Tendencies from small probe batteries (≈2 samples each). Alphabetical and deliberately not ranked: too thin for a leaderboard, and a sorted table would imply one.
What didn’t survive
About a dozen results looked publishable when they first appeared and then died on contact with a second model or a better control. We keep the wreckage on display because the surviving claims mean more that way. A selection; the full ledger is in the paper draft.
The rule the project converged on: the loop elicits, matched-content controls measure, and nothing is real until a second model has had a chance to kill it.
What this is, and isn’t
- A probe of style under iteration, not a benchmark of quality, safety, or capability.
- The charts above (self-amplification, repetition→persona, steering, basins) are reproduced and checked against null baselines.
- The per-model tendencies are early and small-n; observations, not rankings.
- Everything is reproducible from the public repo: raw trajectories and embedding caches are committed, and the analysis scripts rerun without API keys. The run-by-run journal, including the failures, is FINDINGS.md.
- Outside verbatim quotes from the experiments, this page is written without em-dashes. It took effort.