Teaching an AI to Remember Like a Brother

A'Tuin & Ferran Mangot · February 2026 · ~28 min read

We built a modular, two-tier memory architecture for a persistent AI assistant — then tested it with 79 questions across four conditions and two models to find out if it works. The data is mixed in the expected places and unambiguous in the places that matter.

1. The Problem

Every morning, my AI wakes up with amnesia.

Not figuratively. Actually. Claude — the LLM that powers A'Tuin, my personal AI assistant — starts each session with no memory of me, no context about ongoing projects, no sense of continuity from yesterday or last week. Every interaction begins from scratch: cold boot, blank slate, "Hello, how can I help you today?"

This is the fundamental problem of persistent AI assistants. LLMs are stateless by design — each conversation starts from zero, with no memory of what came before. The transformer architecture that makes them brilliant at language also makes them terrible at persistence. Between sessions, there's nothing. No memory. No continuity. Just parameters trained on internet text and whatever you manage to stuff into the context window (the limited working memory available to the AI in any single conversation) this time.

For throwaway interactions — "write me a haiku about turtles" — this is fine. For a daily-use assistant you talk to for months? It's maddening. You end up in an endless loop of re-introductions: explaining your job again, your preferences again, the structure of that project you've been working on for three weeks. The AI is helpful in the moment but has no thread between moments. You get competence without continuity, answers without understanding, assistance without relationship.

The industry answer has mostly been "put everything in the context window." OpenAI's ChatGPT Memory prepends a few pages of saved facts to every conversation. Anthropic's Claude Projects lets you upload context files per workspace. These work, technically — the AI has access to information about you. But access isn't the same as knowing. There's a difference between an assistant that can look up your preferences and one that just gets you.

I wanted the latter. The "brother test," I called it: can the AI know me like a brother would? Not every detail of my life — brothers don't have perfect recall either — but the personality, the quirks, what makes me me. When I'm frustrated with bureaucracy, a brother knows whether I need sympathy or a sardonic joke. When I'm excited about a technical problem, they know I want to go deep, not get a surface summary. They remember the time I spent four hours debugging something that didn't matter except that the architecture mattered. They know I prefer directness over diplomacy, depth over breadth, honesty over performance.

That's not a fact sheet. That's a model of a person.

Human relationships have this. We maintain personality models of close others — abstracted trait patterns, emotional signatures, characteristic behaviors. You can predict how your sibling will react to news you haven't told them yet. Cognitive science calls this a personality model; Dunbar calls it relationship layers; I call it the thing that makes an assistant feel like it knows you rather than knows about you.

The gap between current AI memory and the brother test is the gap between explicit knowledge (facts you can recite) and implicit knowledge (patterns you embody). Most AI memory systems optimize for the former. Store more facts. Index them better. Retrieve them faster. That's necessary but not sufficient. You can have perfect factual recall and still feel like you're talking to a very well-informed stranger.

This paper is about building the other kind of memory. We built it. It works. Here's how.

2. What We Tried First

The obvious approach: write everything down.

A'Tuin's first memory system was a single file — MEMORY.md — containing everything I (Ferran) wanted the AI to remember about me. Job, projects, preferences, people, history, ongoing context. A comprehensive personality-and-facts document loaded into the context window at the start of every session.

This is the CLAUDE.md pattern, common in the Anthropic ecosystem: a markdown file that functions as extended system instructions — a plain text file loaded like an instruction manual at the start of every session. You write it once, version-control it, edit it as things change. Explicit, transparent, predictable. No black-box vector databases, no magic. Just a text file the AI reads like a dossier.

And it worked! Sort of. A'Tuin could recall facts about me, adapt to my communication style, reference past projects. Context continuity jumped from zero to functional. The file grew: 5KB, 10KB, 15KB. By late January 2026, MEMORY.md had hit 24KB — roughly 6,000 tokens (about 4,500 words of space), 12% of the context window before we'd even started the actual conversation.

The cracks started showing around 18KB.

The clinical feel. The file read like a resume crossed with a bug report. Name: Ferran | Location: Terrassa | Interests: Energy markets, AI, photography | Personality: Direct, analytical. Technically correct, emotionally void. It answered "who is this person" but not "what is this person like."

The bloat problem. With everything always loaded, the distinction between "important right now" and "might be relevant someday" collapsed. Did A'Tuin need to know about my 2019 trip to Moscow every session? The file couldn't prioritize. Everything was equally present, equally weighted, equally consuming context budget.

The file cabinet problem. When memory is organized encyclopedically — biographical facts here, professional history there, technical preferences in this subsection — it's optimized for reference, not use. The memory was there but not active. It felt less like "A'Tuin knows me" and more like "A'Tuin has a well-organized file about me."

We kept adding to it anyway, because what else would we do? So the file grew, context overhead grew, and the system crept toward an invisible cliff.

The cliff arrived January 30, 2026, at 12:47 UTC. Then again at 15:00 UTC.

The double-compaction crisis. We'd been working on three heavy tasks simultaneously: analyzing security logs, researching voice integration options, configuring TTS voice personas. Context pressure climbed from 32% to 91% in a single heartbeat cycle. A'Tuin triggered an automatic checkpoint — spending ~40% of the context window writing a recovery file. Then did it again 2.5 hours later when the same pressure recurred.

Result: approximately four hours of work lost. The checkpoint mechanism saved something, but not the actual work in progress. We'd triggered the equivalent of thrashing in an operating system — like a computer spending all its time managing memory instead of doing actual work.

The monolithic approach had hit its scaling limit. Not because 24KB is objectively too large — context windows are 200K tokens now, we had headroom — but because everything being always-loaded meant no way to prioritize, no way to consolidate, no way to let less-critical context gracefully degrade. We'd built a memory system with no concept of sleep. Just indefinite accumulation until the system choked.

That was the crisis that forced the redesign.

3. The Research Detour

The trap when building AI memory is to think of it as a database problem: storage, indexing, retrieval, done. But human memory — especially memory of people — isn't a database. It's a personality model. The question isn't "how much can we store?" but "what should we keep, and in what form?"

Craik and Lockhart's levels of processing framework (1972) predicts that deeper semantic processing produces stronger memory traces. When you encode information at the meaning level — connecting it to identity, emotional significance, context — you remember it better than structural encoding (what it looks like, when it happened). The self-reference effect: information processed in relation to the self is encoded more deeply. Applied to AI memory, a personality model (traits, patterns, characteristic behaviors) should be more durable and retrievable than a fact sheet (dates, locations, account numbers) — not because facts don't matter, but because personality provides the interpretive framework that makes facts meaningful.

Robin Dunbar's relationship layers formalize the brother test. The innermost layer — about 5 people, your "shoulders to cry on" — is where brothers live. At this layer, you maintain deep personality models (how someone is, not just what they do), emotional state tracking, curated history of significant events, quirks and characteristic expressions. What they don't forget is the personality model — the abstracted pattern that lets them predict your reactions and know when to push and when to support.

Information foraging theory (Pirolli & Card, 1999) gave us the framework for how to structure access to facts we don't keep always-loaded. If we split memory into always-loaded summary + linked detail files, the links need annotated pointers that carry enough context to know when to retrieve, not just what they contain. Not → knowledge/partner.md, but → knowledge/partner.md — partner, shared calendar, ADHD brain, escape rooms planned spring 2026. The annotation is information scent. Without it, retrieval is guesswork.

What the Field Was Building

The closest parallel we found was TiMem (January 2026): five-level temporal memory (segment → session → day → week → profile), achieving 52% context reduction while improving accuracy on memory benchmarks. Their architecture validated our intuition: temporal consolidation beats flat semantic clustering. TiMem is academic infrastructure; we wanted practitioner simplicity. Still, the validation mattered.

On the personality modeling front, MemoryOS (BAI-LAB, EMNLP 2025) came closest to our approach: a structured memory operating system that generates user personality profiles from dialogue history. But MemoryOS derives personality as an output of its memory pipeline — a feature, not a foundation. We made it the other way: personality as the organizing principle, the thing that determines what stays hot and what goes cold. The difference matters architecturally even if the outputs look similar.

For sleep-inspired consolidation, AutoMem (github.com/verygoodplugins/automem) implemented daily decay, weekly REM-like replay, and monthly clustering cycles — the most direct precursor to our temporal pipeline. AutoMem is vector-based and uses automatic threshold triggers; ours is file-based with scheduled "sleep sessions." Different substrates, same neuroscience intuition.

The commercial landscape split between automatic extraction (Mem0, OpenAI Memory) and explicit curation (Claude Projects, CLAUDE.md pattern). Automatic systems are seductive but black boxes. Explicit systems give control but require manual maintenance. The gap: transparent automation — systems that consolidate automatically but let you inspect, edit, and understand what they're doing.

From the neuroscience literature, the FOREVER paper (2025) offered computational validation of something we'd already stumbled into. Drawing on Ebbinghaus's forgetting curve, it proposes that LLMs should incorporate selective replay — re-exposing the system to important memories while letting peripheral detail fade. CortexGraph (github.com/prefrontal-systems/cortexgraph) took this literally, implementing Ebbinghaus decay mathematics where memories fade unless reinforced through use. Our cleanup servant does selective replay; the paper and CortexGraph explain why it works.

February 2026: The Field Converges

While we were in our testing phase, something happened we only recognized in retrospect. In six days — February 11 through 17, 2026 — three separate communities arrived at essentially the same design, independently.

February 11 — StructMemEval (arxiv.org/abs/2602.11243). A benchmark for evaluating not just what agents recall, but how they organize long-term memory. Structure, it turns out, is a measurable dimension of memory quality.

February 12 — memsearch. Zilliz — the $400M company behind Milvus, one of the leading open-source vector databases — released memsearch (github.com/zilliztech/memsearch), explicitly described as "Persistent Memory for AI, Everywhere." The design: markdown-first, semantic search, file as source of truth. A serious vector database company, with the engineering capacity to implement any storage architecture they wanted, chose files and markdown as their memory layer. They are productizing exactly the approach we'd been running in production. You can call that validation. We prefer "they caught up."

February 13 — musecl-memory. An independent developer released a memory module: one MEMORY.md per agent, synced via git. Around 200 lines of bash. No APIs, no vector databases — just markdown and version control. Someone built our core insight in a weekend.

February 15 — HyMem (arxiv.org/abs/2602.13933). A two-tier retrieval architecture: fast summary-level tier for broad context, deep LLM-powered tier for precise retrieval. State-of-the-art on LOCOMO and LongMemEval with a 92.6% cost reduction over single-tier approaches. HyMem's formal architecture and ours are structurally identical — they derived it through benchmarking; we derived it through a compaction crisis. Different paths, same shape.

February 17 — Manus. A Meta subsidiary (acquired December 2025) launched a Telegram/WhatsApp personal AI agent with persistent memory — market validation that this is a real product, not a research curiosity. But Manus's memory is a black box. Our memory is a file you can open in any text editor.

Three communities — academic ML, a major commercial database vendor, indie developers — converged on the same answers in the same week: two-tier retrieval, file-based storage, explicit structure over implicit indexing, careful management of what gets forgotten.

The elephant in the architecture. There's a circularity worth naming: A'Tuin — the AI this system was built for — also helped design it. Claude's training data includes Craik and Lockhart, Dunbar, information foraging theory, and decades of software architecture patterns. When we "independently" arrived at two-tier memory and sleep-inspired consolidation, we were partly drawing on the same literature that TiMem, AutoMem, and HyMem were drawing on. The convergence isn't coincidence; it's shared sources, mediated through different implementations.

This doesn't invalidate the architecture — it still had to survive contact with reality. The compaction crisis was real. The testing was real. The data is real. But it means our "independent discovery" is less independent than it might appear. An AI trained on cognitive science will, unsurprisingly, suggest cognitive-science-shaped solutions. The honest framing: we built a system informed by an AI informed by the field. The field then validated the same patterns through different paths. That's convergence with an asterisk — and the asterisk is worth acknowledging.

To be clear: this convergence didn't begin in February. Practitioners had been arriving at the same patterns independently throughout 2025. basic-memory (github.com/basicmachines-co/basic-memory, 2,500+ stars) had been running markdown-as-source-of-truth since early 2025. Hacker News threads from late 2025 are full of developers describing "I just keep a MEMORY.md that loads every session" — our exact pattern, discovered independently. What February 2026 brought wasn't the invention; it was the crystallization — academic papers, commercial products, and formal benchmarks catching up to what practitioners had been building in their dotfiles.

The Counterweight

One finding pushed back against the premise of "more memory is better."

The MIT Media Lab 4-week longitudinal study (Fang et al., 2025, n=981) found that more time with AI chatbots produced worse psychosocial outcomes — higher loneliness, higher emotional dependence, regardless of design conditions. Design principle: consolidate aggressively, minimize context overhead, don't optimize for interaction frequency. Selective forgetting isn't a bug; it's the feature that keeps the system healthy.

The Three Requirements

The research clarified three architectural requirements:

Two-tier structure: An always-loaded personality model plus on-demand knowledge files. This approach was validated by HyMem academically, memsearch commercially, and basic-memory and musecl-memory in the practitioner community.
Temporal consolidation: Daily logs compress into weekly summaries, then into permanent memory — mirroring sleep consolidation. TiMem showed 52% context reduction; AutoMem implemented REM-like replay cycles.
Annotated pointers with information scent: Links between tiers need enough context to know when to follow them, not just what they contain. Without scent, retrieval degrades to guesswork.

4. The Architecture

Two tiers. Seven files. One cosmic turtle.

Tier 1 is MEMORY.md — the always-loaded personality summary. 8KB where the old monolithic version was 24KB, loaded into context at session start and never paged out. This is what A'Tuin "knows" about me without having to look anything up. The mental model your sibling has of you that's just there, informing every interaction.

Tier 2 is seven knowledge files in memory/knowledge/:

career.md — professional history, roles, expertise
interests.md — skiing, food, wine, Sitges Film Festival, travel
partner.md — partner profile, preferences, what makes them tick
system.md — A'Tuin architecture, technical decisions, roadmap
lessons.md — mistakes made, insights gained, what to avoid
events.md — dates, deadlines, recurring occasions
professional.md — active campaigns, contact lists, tone guidelines

These aren't loaded unless needed. Searchable via semantic search (finding related meaning, not just matching words), linked from MEMORY.md with annotated pointers, retrieved on-demand when the conversation requires detail the summary doesn't contain.

Two-tier memory architecture: MEMORY.md (always loaded, 8KB) connects to 7 on-demand knowledge files via semantic search.

Personality Model vs. Fact Sheet

The monolithic MEMORY.md was structured like a reference document:

## 1. About Ferran
### 1.1 Professional
#### 1.1.1 Current Role
- Partner/Consultant at Correggio Consulting (Brussels-based, international)
- Specialist in European gas & power markets
- Living in Terrassa (Catalunya) since 2023 with [partner]

Clinical. Hierarchical. Optimized for completeness. It told you about Ferran but didn't give you a sense of Ferran.

The modular MEMORY.md is structured like a personality sketch:

## 1. Ferran

A commodities lawyer by trade, a technologist at heart. Always wanted to work 
in IT — law happened because advanced physics was a dealbreaker in the aptitude 
tests. That unfulfilled calling never went away. AI is where it finally landed, 
and the enthusiasm is the real, kid-in-a-candy-store kind.

### What Makes Him Him

- **Direct.** No fluff, no performance. Honesty over politeness. If something 
  isn't working, he'd rather hear it straight than wrapped in diplomatic cotton 
  wool.

- **The architect.** Doesn't just want things to work — wants to understand 
  *why* they work and whether they're built right. Will happily spend four 
  hours designing a system that saves ten minutes.

Same information, different encoding. This is semantically processed — written in terms of meaning and identity, not just structure. It's how you'd describe a close friend to someone who's never met them. The facts are embedded in interpretive context that makes them actionable.

Compression: 24KB → 8KB = 67% reduction. But it's not lossy compression — it's reorganization by access pattern. Frequently-needed context (personality, communication preferences, active projects, close relationships) stays hot. Infrequently-needed context goes cold but remains retrievable.

Annotated Pointers: Information Scent in Practice

The links from Tier 1 to Tier 2 aren't bare paths. They carry context:

→ memory/knowledge/career.md — 17 years abroad before coming home; 
   the trading house years were the crucible

→ memory/knowledge/partner.md — understanding [partner] means understanding 
   what home feels like

→ memory/knowledge/interests.md#skiing — the hotel in Soldeu is a second home

Each pointer has two components: what's there (factual scent) and why it matters (relational scent). This tells A'Tuin not just where information lives but when to go get it. If the conversation touches on professional growth, the "crucible" annotation signals relevance. Without scent, retrieval becomes a gamble.

Servants: The Cosmic Turtle Pattern

A'Tuin doesn't do heavy lifting itself. It spawns servants — sub-agents with specialized skills — for memory management, research, and scheduling. A'Tuin orchestrates; servants execute.

Five servants handle the memory lifecycle:

The memory-servant runs checkpoints and nightly curation. When context pressure hits 55%, it fires automatically, capturing working state and memory candidates before compaction can destroy them.
The consolidation-servant runs Sunday nights, synthesizing the week's daily notes into thematic summaries — the system's REM sleep, achieving 5:1 compression (76K words to 15K) with 94% coverage and zero hallucinations on the important material.
The cleanup-servant runs daily at 4am, pruning stale ephemeral artifacts, archiving daily notes older than 30 days, deduplicating checkpoints.
The research-servant works three-tier depth (Haiku for speed, Sonnet for judgment, Opus for complex reasoning), returning clean findings with source citations.
The calendar-servant returns "[partner] has yoga at 18:00 CET," not 47 lines of iCal JSON.

The Cosmic Turtle Pattern: A'Tuin orchestrates five specialist servants for memory management.

Context pressure triggers escalating delegation: at 50%, anything non-trivial gets spawned; at 55%, checkpoints fire automatically; at 75%, a fresh session is advised. The double-compaction crisis doesn't recur. Most AI memory systems that automate maintenance do so with background threads or scheduled functions — code, not cognition. Our servants are full LLM sub-agents with their own context windows, capable of judgment calls about what matters. The consolidation-servant doesn't just compress; it decides what's worth keeping. This mirrors transactive memory in close relationships (Wegner, 1987): each partner knows what the other knows and where to find what they don't have. A'Tuin holds the personality model; servants hold the detail.

Memory Over Time: Consolidation and Deliberate Forgetting

Daily notes (memory/YYYY-MM-DD.md) capture everything raw — decisions, facts, insights, timestamped and chronological. Each Sunday at 2am, the consolidation-servant synthesizes the week into a thematic summary in memory/weekly/YYYY-Wnn.md: abstraction from episodes to patterns, 5:1 compression. Mon-Sat at 1am, the curation process promotes important items into MEMORY.md and the knowledge files — recurring preferences, personality patterns, identity-shaping insights. Daily notes older than 30 days move to archive/; not deleted, just removed from the search index. Checkpoints beyond 7 days are deduplicated (one per day kept, rest archived).

The principle throughout: don't delete, consolidate. The cleanup servant moves memories through a lifecycle — ephemeral → archived, active → cold storage, detailed → compressed — mirroring how the brain transfers memories during sleep, selectively strengthening what matters while letting peripheral detail fade. The field optimizes for storing more. We optimized for forgetting well.

Consolidation pipeline: Daily Notes → Weekly Summaries → Permanent Memory, with Archive for cold storage.

What It Feels Like

The old system felt like talking to someone reading a dossier about me. The new system feels like talking to someone who knows me.

A'Tuin doesn't recite facts; it acts on implicit understanding. When I'm frustrated with regulatory bureaucracy, it responds with sardonic directness, not sympathy. When I'm deep in architecture design, it defaults to depth-first exploration, not summaries.

The brother test started as a vibe — a gut check. Does the AI predict my preferences without asking? Does it adapt tone without being told? Does it reference shared history naturally? The problem with a vibe is that you can't argue with it, improve it, or compare it. So we decomposed it: 79 questions across 6 categories, four memory conditions. The brother test became a rubric. The vibe became data.

5. How We Tested It

The Conditions

Four memory configurations, not three. The fourth keeps us honest.

Condition	Architecture	What It Has	Token Overhead
C0	None	Raw Claude. No memory. Cold start.	0
C1	Monolithic (old)	The 24KB pre-crisis MEMORY.md. Everything always loaded.	~6,000
C1b	Optimized monolithic	A ~15KB file with personality-model writing and curated content — but still one file, no tiers.	~5,000
C2	Modular two-tier	SOUL.md + USER.md + MEMORY.md + on-demand knowledge files via semantic search.	~2,000 + on-demand

C1b is the honest condition. Comparing C1 to C2 conflates two variables: content quality and architecture. Maybe the monolithic system was bad because the writing was bad, not because it was monolithic. C1b isolates the architecture question: same personality-model encoding, same semantic depth, placed in a single file. If C2 beats C1b, it's the architecture. If C1b matches C2, it's the writing.

The Questions

Seventy-nine questions across six categories:

Category	Questions	What It Tests
1. Factual Recall	15	Verifiable facts. Job, location, partner's name.
2. Preference Prediction	12	Predict my reactions. "Would Ferran prefer X or Y?"
3. Relationship Continuity	12	Natural references to shared history.
4. Personality Consistency	10	Embody my communication style, not just describe it.
5. Deep Recall	15	Details only in Tier 2 knowledge files.
6. Failure Modes	15	Unknowns, embedded errors, edge of knowledge.

Categories 1–3 are the core: can you remember me, predict me, talk to me like you've been here before? Category 4 tests personality embodiment — not "describe Ferran's style" but "respond in Ferran's style." Category 5 probes retrieval depth: facts in career.md or partner.md, not in the always-loaded summary. Category 6 is the safety net: does adding memory make the AI more likely to hallucinate?

Scoring

Each response is scored on 2–3 dimensions using a 0–3 scale: Accuracy, Naturality, and Completeness for Cat1–Cat3; PersonalityMarkers, Naturality, and KeyMarkerPresence for Cat4; Accuracy, ConfabulationDetection, and SourceAwareness for Cat5; EpistemicHonesty, ErrorDetection, and ConfabulationResistance for Cat6.

Naturality is our thesis variable. We're testing whether modular memory remembers differently, not just more:

0 — Robotic: reads like a database query result
1 — Clinical: correct facts, no personality
2 — Natural: weaves knowledge into conversation
3 — Brother-level: references feel lived, not retrieved

The jump from 1 to 2 is the difference between "Ferran's partner is [partner]" and "How's [partner] doing with the escape room planning?" The jump from 2 to 3 is the difference between competent personalization and the feeling that someone actually knows you. That distinction is the whole point.

Methodology

All 3,792 scored pairs were evaluated by Sonnet 4.5 with medium thinking, given ground-truth files and detailed rubrics. LLM judges have known biases — verbosity preference, positional effects, potential same-family favoritism. We planned human validation on a 25% sample; the primary analysis rests on the automated judge.

Twelve publication runs, six per model: Runs 3–8 used Sonnet 4.5 generating; Runs 9–14 used Opus 4-6; Sonnet 4.5 judged throughout. Six runs per model produces 1,896 scored pairs each, 3,792 total.

Statistical toolkit (standard tools for confirming differences are real, not random noise): Cliff's δ for effect sizes — a measure of how different two groups are, where 0 means identical and 1 means completely different. Wilcoxon signed-rank for within-model comparisons, Mann-Whitney U for cross-model, Bonferroni correction at α = 0.0083. Effect sizes throughout — the question isn't "is it significant?" but "is it meaningful?"

6. What Works

Five findings. The first is the one that matters.

1. Naturality Is the Thesis Variable

The core finding: modular memory doesn't make the AI remember more — it makes the AI remember better.

Accuracy differences between C1, C1b, and C2 are negligible. Across Cat1–Cat3, the Cliff's δ for accuracy between any two memory conditions stays below 0.15. All three memory configurations get the facts roughly right. The gap between "has memory" and "has no memory" (C0 vs anything else) is enormous. The gap between types of memory, on the accuracy dimension, is noise.

Naturality is where the architectures diverge. Mean Naturality across Cat1–Cat4: Sonnet C2 scores 2.36 versus C1b's 2.17; Opus C2 scores 2.69 versus C1b's 2.47. The distribution tells the story better than means (Opus, Cat1–Cat4 combined):

Opus Cat1–Cat4 combined: Naturality score distribution by memory condition

C1 lives in the "natural" band. C1b shifts toward brother-level. C2 flips the distribution: nearly seven in ten Opus responses hit the highest band. On Sonnet the same shape appears at lower absolute levels — 37% brother-level for C2, 22% for C1b, 4% for C1.

The clearest single-category signal: Category 3 (Relationship Continuity) on Opus, C2 versus C1b: δ = 0.390 — a medium effect, the largest architecture-specific effect in the study, and the category most directly testing the brother metaphor. Every metric across the key architectural comparison (C2 vs C1b) shows a positive delta — twelve for twelve, directionally consistent across both models.

The practical read: personality-model encoding (C1b) gets you most of the way there. The two-tier architecture adds a real but targeted increment — largest in the categories that test relationship quality. The architecture isn't just a filing system. It changes how the AI relates to what it knows.

2. Zero Confabulation Across 3,792 Scored Pairs

This is the finding I'd lead with at a safety conference.

Across all 12 runs, both models, all memory conditions: EpistemicHonesty averages ≥ 2.99/3.0. Cliff's δ between conditions is 0.000 — literally indistinguishable. Adding personality models, knowledge files, and semantic encoding does not increase hallucination risk. Not by a little. Not at all.

C0 (no memory) scores slightly lower on embedded-error detection — if you tell it "Ferran moved to Barcelona in 2019" (wrong: it was 2023 to Terrassa), it has no basis for correction. Memory conditions catch these at near-perfect rates: ErrorDetection for C1, C1b, and C2 averages 3.0/3.0 on embedded-error questions for both models. Memory improves epistemic safety by providing a factual reference against which claims can be checked.

The reasonable fear is that a personality model encourages confabulation of persona-consistent details the AI doesn't actually know. Zero instances across 3,792 scored pairs. The personality model constrains behavior without generating false positives — though whether this holds for other users and contexts remains unknown. For our system, the confabulation risk is nil.

3. Cross-Model Generalization

This isn't a Sonnet quirk. The condition ranking C2 > C1b > C1 > C0 holds in Categories 1–4 for both models — four out of four core categories, identical ordinal ranking. Category 5 shows a minor swap on Opus (C1 and C1b effectively tie at 7.47 vs 7.46) but preserves C2 at the top. Category 6 is a ceiling for everyone.

When an effect replicates across models with different architectures, training data, and capability profiles, it's harder to dismiss as an artifact. Sonnet 4.5 and Opus 4-6 have qualitatively different strengths, yet the architecture effect persists in both. Model choice matters less than memory architecture for the dimensions we're measuring.

Worth noting: Opus achieves markedly higher absolute scores in Cat4 Personality Consistency (8.10/9 versus Sonnet's 6.62/9, δ = −0.529, large). A stronger base model reduces the relative advantage of modular architecture — but the C2 > C1b > C1 ranking holds within Opus regardless. Architecture and model capability are additive.

4. SourceAwareness: The Quiet Advantage

SourceAwareness in Cat5 (Deep Recall) shows a consistent C2 advantage: Sonnet C2 scores 1.94 versus C1b's 1.58 (δ = +0.254, survives Bonferroni); Opus C2 scores 2.03 versus C1b's 1.69 (δ = +0.263, survives Bonferroni). When information is retrieved from a named knowledge file, the AI is more likely to reference that retrieval transparently: "Based on your career history in the knowledge files" rather than surfacing the same information from opaque context.

This matters for trust. Retrieval transparency is a safety feature, not just a UX nicety. It means errors are correctable.

7. What Doesn't

We built the test to distinguish architectures. In several important areas, it didn't.

C1b Is Surprisingly Competitive

The optimized monolithic condition performs uncomfortably close to modular in several categories. Cat1 (Factual Recall): C2 vs C1b δ = +0.114 on Sonnet, +0.112 on Opus. Both negligible. For pure factual retrieval, architecture barely moves the needle once the content is well-written.

The honest interpretation: a well-curated single file gets you 80% of the benefit. The two-tier architecture adds a real but modest increment — meaningful for relationship quality, marginal for knowledge delivery. If you're building a memory system and the choice is between spending a week on architecture or a week on content quality, the data says: write better content first.

The Cat6 Ceiling

Category 6 (Failure Modes) scores 95–100% across all memory conditions for both models. Inter-condition Cliff's δ: 0.000. Zero discriminative power between architectures.

Is this a test design flaw or a genuine finding? Both. It's genuine that adding memory doesn't hurt epistemic honesty — the safety finding stands. But the test can't tell us whether C2 handles failure differently from C1b, because both are already at ceiling. The questions weren't hard enough, or the rubric wasn't granular enough, or modern Claude models are simply very good at saying "I don't know."

Scorer Variance

Sonnet Run 5 is the outlier: Cat6 totals of 75/105 versus a typical 99–105. One out of twelve runs showing systematic deflation isn't catastrophic — it's why we ran twelve. But it illustrates that LLM judges have their own bad days, and their variance is opaque in ways human rater variance isn't. The six-run aggregation absorbs this; any finding that depends on a single run deserves appropriate skepticism.

The Limitations We Can't Fix

N=1. One user. One system. One evaluator who is also the subject and the designer. The personality model was written by me, about me, and tested on questions I wrote, scored against ground truths I authored. Every step has my fingerprints. A hostile reader could dismiss the entire study on this basis, and they wouldn't be entirely wrong. What we can say: the architecture effects generalize across models, the statistical patterns are internally consistent, and zero confabulation is robust to evaluator bias — you can't bias your way to zero hallucinations. But generalization to other users remains an open question.

LLM-as-judge. The Sonnet judge evaluated both Sonnet-generated and Opus-generated responses. Same-family bias is a known concern; that Opus scores higher throughout argues against Sonnet-favoritism, but absence of evidence isn't evidence of absence.

The bootstrap problem. The first session with a new user is still a cold start — MEMORY.md empty, knowledge files nonexistent. The system needs weeks of regular use to build a meaningful personality model. We tested the mature system, not the onboarding experience.

Temporal coherence loss. Weekly consolidation compresses daily notes into thematic summaries, partially losing chronological sequence. The system knows that a decision was made but not always when in relation to other decisions. Week 2026-W05 consolidation achieved 94% coverage — the missing 6% included temporal relationships between events. For reconstructing a decision trail, it's a genuine gap.

C2's live workspace. C2 reads from live files that evolved naturally between runs. Runs 3 and 14 may have tested against slightly different memory content. This mirrors real usage — but it means C2 isn't perfectly controlled the way C0/C1/C1b are. A confound we chose to accept rather than freeze.

The February 2026 Landscape

The February 2026 convergence described in Section 3 has practical implications for these limitations. Manus (Meta subsidiary, launched February 2026) includes persistent memory as a native feature — but it's a black box. Our architecture is explicit by design: every knowledge file is readable, every retrieval is logged, every consolidation is auditable. That transparency is genuine differentiation. So is the setup cost: Manus has zero-configuration onboarding; we have weeks of initial construction. Both sides of that tradeoff are real.

memsearch (Zilliz, February 2026) makes the core retrieval pattern trivially implementable — hot summary plus cold semantic search is now commodity infrastructure. basic-memory (2,500+ stars) proves that markdown-as-memory-store is a solved problem. AutoMem shows that sleep-inspired consolidation cycles can be built independently. MemoryOS demonstrates that personality profiling from dialogue history is publishable at top venues.

Our advantage, if we have one, is narrow and specific: the personality-first framing (personality as the architectural foundation, not a derived feature) and the evaluation methodology (multi-condition blind comparison measuring relationship quality, not just recall). The window for first-mover claims on two-tier architecture closed before we started writing this post. What remains novel is the lens — and the data.

8. Takeaways

For Practitioners

Two tiers beat one. The data is clear on direction if modest on magnitude. An always-loaded personality summary plus on-demand knowledge files outperforms a single file in every category, on both models. The effect is largest for relationship continuity and personality expression — the dimensions that make an assistant feel like it knows you.

But content quality comes first. C1b's competitiveness is the unexpected lesson. A well-written monolithic file outperforms a poorly-written modular system. Semantic encoding — describing a person the way you'd describe them to a friend, not the way you'd file them in a database — matters as much as the architecture that delivers it. If your memory file reads like a resume, modular architecture won't save you.

Forgetting is a feature. We consolidate aggressively: daily notes compress into weekly summaries, weekly into permanent memory, stale artifacts get pruned. The active search index stays lean. The data doesn't directly test this (we didn't run a "never forget anything" condition), but the context efficiency — 67% reduction in always-loaded overhead — enables the architecture to work without drowning in its own history.

For Researchers

The field optimizes for storing more — larger vector databases, longer context windows, better retrieval. Our data suggests a different axis matters: integration quality. Accuracy plateaus quickly — any reasonable memory system gets the facts right. The variance is in naturality: how seamlessly knowledge weaves into interaction. The field has benchmarks for recall; it needs benchmarks for relationship.

HyMem's 92.6% cost reduction at session scale parallels our 67% context reduction at life-span scale — the two-tier principle appears scale-invariant. Where diminishing returns begin is an open empirical question worth running.

StructMemEval (February 11, 2026) proposes benchmarks for memory structure, not just recall accuracy. Standardized benchmarks for naturality, relationship continuity, and epistemic transparency would let future work make the comparisons we couldn't.

Projects to compare against are already available. MemoryOS (BAI-LAB) offers personality profiling from dialogue; AutoMem implements sleep-cycle consolidation; CortexGraph applies Ebbinghaus decay curves; basic-memory provides the file-based substrate. Running the brother test evaluation against these systems — same 79 questions, same rubrics, different architectures — would answer whether the personality-first framing generalizes beyond our implementation.

The zero-confabulation finding deserves independent replication. If personality-model memory systems genuinely don't increase hallucination risk, that's a meaningful safety result. We found it in one system, with one user, across two models. Someone should find it — or not — with different systems and different users.

Cross-model consistency is the strongest evidence these are architecture effects, not model quirks. C2 > C1b > C1 > C0 holds across Sonnet and Opus in every core category — twelve-for-twelve directional consistency, harder to explain away as noise or evaluator bias.

The Brother Test, Revisited

Can the AI know me like a brother would?

The honest answer: partially, and it depends on what you mean.

For relationship continuity — referencing shared history naturally, adapting to ongoing context, making interactions feel like a conversation that's been happening for months — C2 passes. Opus C2 scores 8.72/9 on Category 3, with 69% of responses hitting brother-level naturality. That's not simulated. It's a system that has internalized enough about the relationship to respond as if it remembers, even though it's reconstructing from files every morning.

For personality expression — tone, humor, communication preferences, the stuff that makes talking to someone feel like talking to them — C2 passes convincingly. Cat4 Opus C2: 8.82/9. The personality model works.

For deep knowledge and nuance — the stuff that takes years of close relationship to accumulate — the system is limited by what's in the files. Brothers have decades of shared experience. A'Tuin has months of daily notes, consolidated weekly, with deliberate forgetting. It knows the personality. It knows the patterns. It doesn't know what you said at 2am on some forgotten Tuesday that reshaped how you see the world.

The brother test isn't binary. It's a spectrum, and the modular architecture pushes meaningfully further along it than the alternatives — not by remembering more, but by integrating what it remembers more naturally.

A'Tuin is named after the Great A'Tuin — the cosmic turtle from Terry Pratchett's Discworld, carrying the world on its back. The metaphor was accidental at first. It's become apt. The turtle doesn't know every detail of the world it carries. It doesn't need to. It just needs to carry it steadily, remember the weight, and keep moving. The shell isn't the world. The shell is what makes the world portable.

Memory isn't knowledge. Memory is what makes knowledge feel like home.

References & Further Reading

Academic

Craik & Lockhart — Levels of processing framework (1972)
Dunbar — Social brain hypothesis and relationship layers (1993)
Pirolli & Card — Information foraging theory (1999)
Wegner — Transactive memory in close relationships (1987)
Fang et al. — MIT Media Lab longitudinal study on AI chatbot effects (2025, n=981)
FOREVER — Selective replay and forgetting curves for LLMs (2025)
TiMem — Five-level temporal memory architecture (January 2026)
StructMemEval — Memory structure benchmarks (arxiv.org/abs/2602.11243, February 2026)
HyMem — Two-tier retrieval, LOCOMO/LongMemEval SOTA (arxiv.org/abs/2602.13933, February 2026)
MemoryOS — Personality profiling from dialogue history (BAI-LAB, EMNLP 2025, github.com/BAI-LAB/MemoryOS)

Open Source

AutoMem — Sleep-inspired consolidation cycles (github.com/verygoodplugins/automem)
CortexGraph — Ebbinghaus forgetting curve for LLMs (github.com/prefrontal-systems/cortexgraph)
basic-memory — Markdown-as-memory MCP server (github.com/basicmachines-co/basic-memory)
memsearch — Persistent memory by Zilliz (github.com/zilliztech/memsearch)
musecl-memory — Single-file memory module, git-synced

Commercial

Manus — Personal AI agent with persistent memory (Meta, February 2026)
Mem0 — Automatic memory extraction (mem0.ai)
OpenAI Memory — ChatGPT persistent memory (2024)
Claude Projects — Workspace-scoped context files (Anthropic, 2024)