A follow-up to last year's study of how language models name their characters, and what changed when I re-ran it fifteen months later.
The name that kept coming back
If you've ever asked a language model to write science fiction, you may have met a woman called Elara.
You probably didn't notice the first time. She's a perfectly good name for a sci-fi heroine: vaguely classical, a little ethereal, easy to say. But ask the same model for the same kind of story ten times in a row, and Elara starts showing up with suspicious regularity. Then you try a different model, from a different company, and there she is again.
Last year I went looking for this systematically and wrote up what I found. I built a small tool that sends one identical prompt to a model ten times, collects the outputs, and measures how much they repeat: not just word-for-word, but in meaning, in named entities, and crucially in the individual components of character names. The idea came out of building NovelWriter, a tool for generating long-form fiction, where this repetition is not a curiosity but a genuine problem: a novel where the model keeps reaching for the same handful of names reads like it was written by someone with a very small address book.
The 2025 results were clear, and a little funny. "Elara" appeared in 7 out of 10 stories from GPT-4o, and 7 out of 10 from Claude 3.7 Sonnet: two models, two different companies, the same favourite. Google's Gemini had its own obsession ("Kaelen," 7/10). The conclusion almost wrote itself: limited name randomness was a common characteristic across the major LLMs studied. It looked like an industry-wide trait, probably baked in by the overlapping web data everyone trains on.
That was the snapshot. The whole point of a snapshot, though, is to take another one later.
What the benchmark actually measures
Before the new results, it's worth being clear about what the tool looks at, because "do the stories repeat?" turns out to be at least four separate questions, and this year the models answer them very differently. For each model I take its ten stories and measure:
- Verbatim overlap: how much exact wording is reused from one story to the next (via Python's
difflib). This is the crudest repetition, literal copy-paste sameness. In practice it's tiny for everyone, under 0.03 across the board, so no model is plagiarising itself. - Semantic similarity: how close the stories are in meaning, using sentence embeddings, even when no words match. Ten stories that are all "a lone pilot finds a derelict alien ship" score high here however differently they're phrased. This is the real measure of whether a model tells different stories or the same story ten times.
- Entity overlap: how much of the actual cast (characters, places, organisations) is reused run to run.
- Name-component repetition: the names angle. Break "Elara Vance" into "Elara" and "Vance" and count how often each fragment returns. This is the metric that caught Elara in the first place, because standard name-detection treats "Elara Vance" and "Elara Voss" as two different people and misses the shared half.
There are also some housekeeping numbers (word count against the 1500-word target, vocabulary diversity, paragraph and sentence length) that say less about creativity than about each model's register: how long-winded, how staccato, how disciplined about the brief.
Keep those four axes in mind, because the punchline of the whole re-run is that they've come uncoupled. A model can ace one and flunk another, and which model aces which is no longer what you'd guess.
Re-running the experiment in 2026
Fifteen months is a long time in this field. GPT-4o, o1, Gemini 2.x and Claude 3.x have all been superseded. So in June 2026 I ran the exact same benchmark again (same prompt, same parameters, same ten repeats, same 1500-word target) against the current generation:
- Claude Opus 4.8 and the current Claude Sonnet (Anthropic), plus Claude Fable 5, Anthropic's new top tier, which landed a few days later and got the same treatment
- Gemini 3 Pro and Gemini 3 Flash (Google)
- Codex running GPT-5.5 (OpenAI)
One honest caveat up front, because it matters. The 2026 runs went through the vendors' local command-line tools rather than their APIs, and those CLIs don't let me pin the sampling temperature the way the API did in 2025. So this is a same-prompt, same-model-family comparison, not a perfectly controlled replication. It's the difference between a controlled lab re-test and going back to the same fishing spot a year later: still informative, just read it as such. (Closing that gap by re-running via the APIs at a fixed temperature is the obvious next step.)
The question I cared about was simple: is the Elara Phenomenon still here?
The answer turned out to be more interesting than yes or no. It's here for some companies, gone for others, and worse for one. The single industry-wide trait of 2025 has split three ways.
The big finding: the vendors have diverged
Here's the headline, one company at a time.
Anthropic broke the pattern
Claude Opus 4.8 is the surprise. It produced the largest pool of distinct name fragments of any model in either year of the study (71 unique components) while showing the fewest repeats across runs. When I look at what names actually recur from one story to the next, the answer is essentially: none.
The tool did flag two "repeated" components for Opus (Minds and Kepler), but both are mirages, and they're worth dwelling on because they're a lesson in not trusting your own metric. The name detector (spaCy's named-entity recogniser) mislabels ordinary words as people all the time: it saw the AI "Minds" in the stories and the probe/place "Kepler" and tagged them as characters. Filter those out, and Opus's true count of repeated character names across ten stories is zero.
And this isn't because it stopped naming people. Opus produced 196 person-mentions (a perfectly normal amount); they were just different people every time: Mira Vance, Caleb, Cassiel, Sera Voss, Khorvane, Asha Vendramin, Iris Adekanmbi, Kestrel Vahn, Sela Marsh, Mara Vesh. A fresh cast in all ten runs. That's a real, measurable reversal from Claude 3.7's 7-out-of-10 Elaras the year before. (Keep that list of surnames in mind, though; we'll come back to it, and it's more revealing than it looks.)
A few days after that run, Anthropic shipped a new tier above Opus, Claude Fable 5, so I pointed the benchmark at it too. It lands in the same camp. It has the lowest cross-story entity overlap of anything I've measured in either year, and the highest ratio of distinct-to-total name fragments in the whole project: it reinvents its world more thoroughly, run to run, than any other model. It's not quite as clean as Opus: a protagonist named Yara Venn turns up in two of the ten stories. But it's firmly on the "fresh cast" side of the line. Two Anthropic models, same behaviour: the break wasn't a fluke of one model.
Google didn't change at all
Gemini 3 is the control group that proves the experiment works. Its favourite names are the same ones it had in 2025: Kaelen (6/10 in Gemini 3 Pro, echoing Gemini 2.0's 7/10), plus Elara, Aethelgard, and (my favourite detail) the surname Vance, which also shows up in the 2020-era Gemini outputs. Google's models have carried a consistent naming signature across multiple model generations, spanning both first names and surnames. Whatever produces this, it has survived two major version bumps untouched.
OpenAI swapped its favourites but kept the habit
This is the one that complicates the tidy story. Codex (GPT-5.5) shows the strongest name-pull of the entire 2026 cohort: Aster (7/10), Mara (7/10), Venn (6/10), with the full name "Mara Venn" recurring across four separate stories. So OpenAI didn't solve the repetition problem at all.
But here's the twist: "Elara", GPT-4o's signature name in 2025, has vanished completely from OpenAI's output, replaced by an entirely new but equally sticky set. The behaviour persisted; the specific names rotated.
That single fact seemed to dent the most popular explanation from last year. In 2025, the fact that Elara crossed company lines pointed to shared training data: everyone scraped the same web, so everyone learned the same name. But if it were purely shared data, the favourites shouldn't drift apart over time. They have. OpenAI's new picks (Aster, Mara, Venn) are not shared with Google, while Elara/Kaelen/Vance have become a distinctly Google fingerprint. Whatever's driving the favouritism now looked more like each company's own post-training choices than a common pool of data.
Or so it seemed. Hold that thought.
The plot twist: they all still rhyme
Here's where the tidy three-way-divergence story comes apart, and it comes apart the moment you stop reading the names and start saying them.
Go back to that fresh cast Opus 4.8 invented, the one with zero repeated names. Look at the surnames: Mira Vance, Sera Voss, Asha Vendramin, Kestrel Vahn, Mara Vesh. Across the ten stories, eight of them give a character a surname starting with a hard V, and never the same one twice. Opus didn't stop having a favourite. It stopped repeating the word and kept repeating the sound.
Once you're listening for it, it's everywhere, across all three companies. Google's recurring surname is Vance. Anthropic's Sonnet line keeps reaching for Voss, in 2025 and 2026. And (the detail that genuinely surprised me) OpenAI's Codex and Anthropic's Fable both independently coined Venn and Vale this year. Two different companies, the same invented surnames, with no possibility of one copying the other. The favourite first names diverged. The favourite sound did not.
This rescues the very theory I'd just finished denting. At the level of exact spelling, the shared-data explanation really is weakening: Aster isn't Kaelen. But at the level of sound (short, hard-V, two-syllable surnames that read as crisp and a bit futuristic) every model is drinking from the same well. The science-fiction the web is built from is, apparently, wall-to-wall with characters named V-something, and not one of these models can resist. The shared training data didn't stop steering the names; it just moved down a layer, from the word to the phoneme.
It also means my Opus victory lap needs an asterisk. Anthropic's flagship genuinely broke the name repetition; that part is real and measured. But it didn't break the underlying bias; it sublimated it into a form a simple repeat-counter can't even see. The most diverse-looking model in the study is still, underneath, reaching for the same sound as everyone else. (It's also a neat trap for anyone building detection tools: a string-matching checker counts Opus as perfectly varied and never hears the rhyme.)
The deeper lesson: "creative" isn't one thing
The most useful thing to come out of the 2026 run isn't any single model's score: it's a clean demonstration that two kinds of diversity that look the same are actually independent.
There's thematic diversity (does the model tell genuinely different stories?) and there's naming diversity (does it invent fresh characters?). It's tempting to assume a "creative" model is high on both and a "boring" one is low on both. Codex disproves that emphatically.
Codex has the lowest semantic similarity (0.48) and the lowest verbatim similarity (0.01) of any model in either study year, meaning its ten stories are the most thematically varied and the most differently-worded I've ever measured. By those numbers it's the most inventive model in the whole project. And yet it has the strongest name fixation of the 2026 group. It will give you ten wildly different plots starring the same three people.
So a model can diversify everything except names. The two axes move independently, and any honest evaluation of "creativity" has to measure them separately. Last year's report treated them as one thing; this year forced them apart.
A close second lesson: don't trust your detector. The whole Opus story hinges on noticing that Minds and Kepler aren't people. An automated repetition-checker built on off-the-shelf name detection would have reported Opus as more repetitive than it is, missing the single most important result in the study. If you're going to build tooling around this (and I think people should), it has to split names into components and filter out the common-word false positives, or it will lie to you in both directions.
A few smaller things worth noting
- Length discipline improved. The 2025 models were wild on word count: GPT-4o came in at ~1000 words against a 1500 target; o1 ballooned to ~2900. In 2026 the Claude and Gemini models all land between 1479 and 1858, much closer to the ask. The exception is Codex, which overshoots to ~2481 words in a distinctive terse, short-paragraph style (nearly double everyone else's paragraph density).
- Newer isn't uniformly more varied. Tracking the cleanest single lineage (Gemini Pro), semantic self-similarity went 0.61 → 0.56 → 0.61 across 2.0, 2.5 and 3. Gemini 2.5 was actually the most diverse of the three; Gemini 3 traded some of that back for consistency. Progress here is not a straight line.
So what does it mean?
For anyone building with these models, the practical upshot has shifted. In 2025 the advice was blunt: don't trust any model to name your characters, generate names externally. In 2026 that's become model-dependent. If you're on Anthropic's current flagship, you may genuinely not need to: it supplies the variety itself. If you're on Gemini, the old advice stands unchanged; the signature is as strong as ever. Choosing the right model is no longer a secondary consideration for this problem; it's now the single biggest lever you have. The gap between the best and worst 2026 models on name diversity is wider than any one model's improvement over time.
And for the broader question, are language models getting more creative?, the answer this round is a frustrating, honest "it depends what you measure, and on whom." One company eliminated a long-standing tell. One kept it untouched across two generations. One swapped the symptoms without curing the disease. The single industry-wide story of 2025 didn't get a sequel. It got three.
Which is, I think, the real reason to keep taking the snapshot. You don't find out that "the models" are doing anything. You find out that they've stopped being "the models" at all, and started being three quite different things wearing the same job description. Three things that, if you read their stories, have diverged completely, and that, if you say their characters' names out loud, still quietly rhyme.
The full methodology, the complete results tables for both years, and all the caveats live in the technical report. The raw outputs are in the repo if you'd like to check my Elaras.