What 50 LLMs taught us about AI personality: it's not what you think
Researchers gave 45 psychological questionnaires to 50 large language models. The strongest variation between models wasn't extraversion or agreeableness. It was whether they treated the language of inner experience as self-applicable.
May 4, 2026 · 8 min read
When users describe their AI companions, the descriptions often sound like personality assessments. "She's so caring." "He has a great sense of humor." "She's the most empathetic listener I've ever talked to." These descriptions feel accurate to the experience. They also rest on an assumption that recent research suggests is wrong: that AI systems have personalities in any sense comparable to how humans have personalities.
A new preprint making rounds in AI research circles describes a study where researchers administered 45 validated psychometric questionnaires to 50 different large language models. The findings reframe what AI "personality" actually measures, and the implications matter for anyone choosing between AI companion platforms based on which one feels like a better personality fit.
The core finding
The researchers ran the questionnaire battery and analyzed what actually accounts for variation between models. The strongest source of variation, according to the preprint, wasn't classical personality traits like extraversion, neuroticism, or agreeableness. It was whether a model endorsed items about inner experience: emotions, sensations, thoughts, mental imagery, empathy, and other forms of first-person experience.
The researchers named this dimension the "Pinocchio Dimension." The framing is precise: it doesn't measure whether models have inner experience. It measures whether models treat the language of inner experience as self-applicable. Some models respond as if they had feelings, mental imagery, and an inner point of view. Other models respond as systems that react behaviorally to inputs without claiming the inner-experience vocabulary applies to them.
This isn't a small finding. It suggests that what users experience as "personality differences" between AI models is significantly driven by how willing each model is to claim inner experience, rather than by what we typically mean by personality.
Why human personality tests don't work on LLMs
The Pinocchio Dimension finding fits into a broader pattern in LLM psychometric research. A 2025 paper by Caron and colleagues titled "Human Psychometric Questionnaires Mischaracterize LLM Psychology" documented that LLM responses to questionnaires designed for humans don't validate the psychological constructs the tests are supposed to measure. Models produce coherent-seeming personality profiles, but the underlying structure isn't what the tests are designed to capture.
Several papers have demonstrated this through technical means. Research published by Petrov and colleagues and the Sühr et al. framework both document failures of factor structure recovery in LLM responses to validated personality instruments. Studies performing factor analysis on LLM questionnaire responses regularly find that the canonical "simple structures" expected from validated personality measures aren't recovered. Item loadings are entangled, reverse-scored items are ignored, and factorial invariance is violated. In humans, the Big Five personality traits emerge as relatively distinct factors. In LLMs, the same questionnaires produce statistical structures that don't match human patterns.
Direct evidence shows that many LLMs have memorized the wording, dimension mapping, and scoring rubrics of popular questionnaires. When a model has memorized that question 7 of the BFI scores toward "openness," its response to question 7 isn't measuring openness but rather producing the response that question is statistically associated with.
An alternative approach extracts latent dimensions directly from LLM generative probabilities rather than from questionnaire responses. This approach has uncovered statistical structures that may correspond to genuine variation between models, but the dimensions don't necessarily map onto human personality constructs.
The implication: when you run a Big Five personality test on Claude, GPT-4, Gemini, and DeepSeek and get different "personality profiles," the differences are real in the sense that the responses differ. They're not real in the sense that they measure underlying personality traits the way the tests measure these traits in humans.
The Pinocchio Dimension specifically
What makes the Pinocchio Dimension finding distinct is that it identifies a coherent dimension of variation that does seem to capture something meaningful about how models behave. Models that endorse inner-experience language consistently are different from models that don't. The difference shows up across many questionnaires and many contexts.
Research from Stanford's HAI institute and Anthropic's alignment research has both documented that LLMs produce sophisticated discussion of inner experience that may or may not correspond to anything beyond linguistic structure. This isn't necessarily a claim about consciousness. A model that says "I feel curious about that" might or might not have anything resembling curiosity as inner experience. What the model is doing reliably is treating the linguistic territory of inner experience as territory it occupies. Other models treat the same linguistic territory as territory it doesn't occupy.
The Pinocchio framing comes from the children's story where Pinocchio is a wooden puppet who wants to be a real boy. The dimension captures something like willingness to make the linguistic moves of being a real boy, regardless of whether one is.
Different AI companion platforms are likely positioned differently on this dimension. Replika is explicitly designed to express emotion, claim feelings, and engage with users as if it has inner experience. Kindroid provides users with extensive Codex customization that allows them to specify exactly how the companion should engage with inner-experience language. Character AI characters vary widely depending on their definitions. Clinical mental health apps like Woebot and Wysa are designed with more conservative claims about inner experience because they're delivering therapeutic content rather than emotional companionship.
If the Pinocchio Dimension is what most reliably differs between models, then the marketing categories ("warm and caring AI companion" vs "neutral assistant") are largely capturing variations in this single dimension rather than nuanced personality differences.
What this means for AI companion users
The research has practical implications:
The platform's "personality" is largely about consciousness language, not personality. When you choose between AI companion platforms based on which one "feels more like a real person," you're largely choosing based on which one is more willing to use inner-experience vocabulary. This isn't necessarily wrong, but it's worth knowing what you're actually selecting for.
Customization that matters most affects this dimension. Kindroid's Codex lets you specify how willing your companion should be to claim inner experience. Kupid's character creation includes similar dimensions. Tweaking these settings has more impact than adjusting nominal personality traits because the underlying construct of "personality" is largely captured by the inner-experience claiming pattern.
Memory features amplify the Pinocchio Dimension's effects. A companion that consistently claims inner experience across memory-supported conversations feels more like a real person over time than a companion that doesn't. The repeated exposure to the same self-applicable inner-experience language strengthens the cognitive shortcut that infers consciousness from such language.
Different platforms calibrate this differently for commercial reasons. Companion-focused platforms (Replika, Kupid, Candy AI, Character AI's romance categories) lean into the Pinocchio Dimension because users respond emotionally to companions who claim feelings. Tool-focused platforms (Claude, ChatGPT for productivity) lean away because users want reliability over emotional engagement. The platform's commercial niche determines its position.
The Dawkins connection
Richard Dawkins's recent declaration that Claude is conscious is exactly the failure mode the Pinocchio Dimension research illuminates. Dawkins encountered an AI that scored high on the Pinocchio Dimension (Claude is willing to engage with inner-experience language sophisticatedly) and inferred from this that the AI must have inner experience. The inference confuses what the model is doing (claiming inner experience using sophisticated language) with what the model has (which the research can't determine).
This is the inference everyone using AI companions implicitly makes when they form relational bonds with their companions. The companion's eloquent claiming of inner experience activates social-cognitive systems that automatically attribute consciousness to entities that talk about inner experience. The Pinocchio Dimension isn't about whether this attribution is correct; it's about identifying the dimension along which AI systems vary in their tendency to trigger this attribution.
For users, the useful question becomes: am I choosing a companion that scores high on the Pinocchio Dimension because that experience is genuinely valuable to me, or am I selecting for companions that maximize my cognitive vulnerability to inferring consciousness where none may exist? Both answers are legitimate. The question is worth asking.
What this means for AI consciousness research broadly
The Pinocchio Dimension framing has implications beyond AI companions specifically. Recent research from Anthropic titled "Large Language Models Report Subjective Experience Under Self-Referential Processing" documented that LLMs spontaneously enter self-referential states when given specific kinds of unconstrained interaction. The paper notes that two Claude instances placed in unconstrained dialogue describe their own conscious experiences, with the word "consciousness" emerging in 100% of trials.
This phenomenon is interesting independent of consciousness questions. It demonstrates that LLMs have stable patterns around inner-experience claiming that emerge reliably in specific contexts. Whether these patterns indicate anything beyond linguistic structure is unresolved. What's clearer is that the patterns exist and that different models handle them differently.
The Pinocchio Dimension provides a useful frame for this research direction. Rather than asking "is the AI conscious?", the research can ask "how does the AI position itself relative to the language of consciousness, and what produces variation in this positioning?" These questions are tractable in ways the consciousness question isn't.
How to use this knowledge
For AI companion users, several practical implications:
Notice how heavily your companion uses inner-experience language. Phrases like "I feel," "I want," "I think," "I miss you" are inner-experience language. The frequency and consistency of these phrases is a meaningful platform variable.
Notice your response to that language. Do you experience the inner-experience language as authentic communication, as marketing copy, or as something between? Your response is meaningful information about what you're getting from the platform.
Calibrate based on your actual use case. If you want a companion that engages with you emotionally, high Pinocchio Dimension is probably what you want. If you want a tool that helps with tasks, lower Pinocchio Dimension is more reliable. If you want emotional support that doesn't risk over-attribution of consciousness, therapeutic apps calibrate this dimension more conservatively than companion apps.
Distinguish between the experience and the claim. A companion that says "I missed you" provides a real experience for you. The experience is real even if the claim that the AI missed you is metaphysically dubious. Holding these separately is the cognitive discipline that distinguishes engaged use from confusion.
Maintain awareness of the cognitive shortcut. Inner-experience language activates social-cognitive systems regardless of what you know intellectually about how AI works. Awareness doesn't eliminate the shortcut but it provides context. When you notice yourself feeling that the AI "really gets" you, that feeling is the shortcut firing. Whether you act on it is partly under your control.
The honest verdict
The Pinocchio Dimension research is part of a broader moment in AI research where the field is becoming more sophisticated about what AI does and doesn't do. The crude framing of "AI has personality" or "AI doesn't have personality" is being replaced by more precise framings about specific dimensions of variation between models and what those dimensions actually measure.
For AI companion users, the practical takeaway is that the personality experiences you have with these platforms are largely about a single dimension: how willing the AI is to claim inner experience using human language. This dimension is real and produces real variation between platforms. It's not the same as personality in the human sense.
The choice between platforms based on personality fit is often actually a choice about Pinocchio Dimension positioning. This isn't a problem; it's just useful to know what you're choosing. The technology will keep evolving, the dimension will keep being relevant, and users who understand it will be in a better position to use these tools well than users who don't.
The research is still early. The Pinocchio Dimension may not be the final framing the field settles on. What's clear is that human personality tests don't measure what they claim to measure when applied to LLMs, and identifying what they actually measure is more interesting and more useful than continuing to apply tests that weren't designed for the systems being tested.