AI Development

Why Your AI App Forgets Everything

LLM APIs are stateless by design — every session starts from zero. Here's why that kills retention, and the four architectural approaches for fixing it.

Why Your AI App Forgets Everything

LLM APIs are stateless. Every call is independent. When your user closes the tab and comes back tomorrow, your app has no idea who they are.

That wasn't a problem in 2022. It is now. ChatGPT launched persistent memory in February 2024, and the moment it did, every other AI app inherited a user expectation it didn't ask for. Users have felt what it's like when an AI actually remembers them. They want that from you too.

This article explains why AI apps forget, what the costs are, and the four approaches for giving your app a working memory — from quick-and-dirty context replay to production-grade knowledge graphs.

The Problem Is Architectural, Not Incidental

When you call the Anthropic or OpenAI API, you pass in a messages array. The API processes it and returns a response. Nothing is stored. The next call starts clean.

This is intentional design, not an oversight. Stateless APIs are simpler, cheaper to run, and easier to reason about. But it means every scrap of context your users share — their name, their preferences, their project details, the thing they told you about last week — evaporates the moment the session ends.

The gap between what users experience with ChatGPT and what they get from most AI apps is real. Users who've used memory-enabled assistants daily no longer question it — they expect it. When your app greets them like a stranger on their tenth visit, that's a retention event.

The data backs this up. AI-driven personalization increases user retention by up to 400%, and businesses that lean into it see 28% lower churn (Envive.ai, 2026). The economics are lopsided: the cost of implementing memory is one-time engineering; the cost of not implementing it compounds every session.

Three Types of Memory Your App Is Missing

Not all "memory" is the same. Before choosing an architecture, get clear on what your app actually needs to remember.

1. Preferences and facts

The most obvious category: things the user tells you about themselves. Their name, role, how they like responses formatted, what stack they use, what they've said they don't want.

These are structured and stable. "User prefers TypeScript over Python" doesn't change often. The retrieval pattern is simple: look up facts for this user ID before composing the prompt.

2. Conversation history

What was discussed, when, and in what order. Not just the raw log — but enough context to answer "what were we working on last Tuesday?"

History is high-volume and partially stale. Replaying 50 past conversations in every prompt is both expensive and counterproductive. The retrieval pattern here is selective: embed past turns, then retrieve the k most semantically relevant chunks for the current query.

3. Knowledge and context

The richest category: entities your app has built up understanding of over time — documents, projects, code, people, evolving facts. A user's codebase. Their customer list. Notes they've taken across dozens of sessions.

This requires structured retrieval, not just text search. The retrieval pattern is graph traversal or hybrid search (semantic + keyword + graph edges), because you need to understand relationships between entities, not just find the closest embedding match.

Four Approaches to Fix It

1. Conversation replay / context buffers

The simplest approach: store the raw message history and prepend some of it to every request.

You already have a database. You add a messages table, log every turn, and inject the recent history into the messages array before calling the LLM. LangChain's ConversationBufferMemory is the canonical implementation of this pattern (LangChain docs).

What works: Zero infrastructure lift. You're writing to the same Postgres instance you already have. The model has full conversational continuity within the replayed window.

What breaks: History grows unboundedly. A user with 10 months of conversations can't have all of it replayed — you'll blow the context window and your token budget. Conversation buffer memory is great for single-session continuity but collapses under long-term use. Token costs scale linearly with context length, and even with prompt caching (Anthropic and OpenAI both offer ~90% discounts on cached input tokens), each turn still costs more as history accumulates.

The practical fix is hierarchical summarization: keep the last N turns verbatim, summarize older ones when you approach ~70% of the context window (Mem0 blog, October 2025). It works, but adds latency and a summarization call to every session.

Best for: Single-session assistants, short-lived tools, apps where users don't return repeatedly.

Complexity: Low. Cost: Low initially, scales badly.


2. Managed memory services

A growing category of purpose-built infrastructure that handles memory extraction, storage, and retrieval for you.

The leading options in 2026:

  • Mem0 (~48,000 GitHub stars, $24M funded) — Framework-agnostic SDK that gives agents three-tier memory: user, session, and agent scopes, backed by a hybrid store combining vectors, graph relationships, and key-value lookups. Claims 80-90% reduction in token costs versus full context replay, with a 26% improvement in response quality.
  • Zep — Structures memory as a Temporal Knowledge Graph, tracking how facts change over time. Best for agents where relationships and evolving facts matter. Achieves P95 retrieval latency of 300ms (Zep arXiv, January 2025).
  • Letta (formerly MemGPT, UC Berkeley) — Implements an OS-inspired model where the agent itself manages what to page in and out of its context window. Gives the model fine-grained memory control, at the cost of more complex setup.

What works: You get production-grade memory without building the extraction, deduplication, and retrieval pipeline yourself. These services handle the hard parts — identifying what's worth remembering, resolving conflicts, evicting stale facts.

What breaks: You're adding a third-party dependency into your critical path. Retrieval latency adds ~100-300ms to every response. And you're giving a vendor access to your users' conversational data, which has privacy implications.

Best for: Apps where memory is core to the value proposition — AI companions, coding assistants, productivity tools, anything with repeat daily users.

Complexity: Medium. Cost: $20-200/month for typical indie workloads; scales with users.


3. Knowledge-graph memory

Instead of storing raw text, you extract entities and relationships and build a graph. "User works at Acme Corp" becomes a node (User) connected by an edge (works_at) to another node (Acme Corp). When you need context, you traverse the graph rather than doing vector search.

Graphiti is the open-source implementation at Zep's core. It builds temporally-aware context graphs — tracking not just what's true, but when it became true and when it changed. Neo4j published a deep-dive on context graphs for AI agents in early 2025 (Neo4j blog).

What works: Graph traversal handles complex multi-hop relationships that vector search misses. "What customers has this user sold to who also use competitor X?" is answerable with graph traversal; pure vector search can't express that query. Retrieval is also highly precise — you're querying structured data, not guessing from semantic similarity.

What breaks: The setup complexity is real. You need to run a graph database (Neo4j, or Zep's managed service), define your schema, and build the extraction pipeline that turns conversation text into entities and edges. LLM-based extraction is where hallucination risk lives — an extraction model that invents relationships corrupts your graph.

Best for: Long-running agents with complex domains — CRM-adjacent tools, project management, research assistants, anything with rich entity relationships.

Complexity: High. Cost: Self-hosted graph DB is cheap; the LLM extraction calls add up.


4. DIY with your existing database

If you're already on Postgres, you can add pgvector and implement your own memory layer without any new infrastructure.

Supabase ships pgvector by default and has solid docs for embedding and retrieving conversation history (Supabase AI docs). The pattern: embed each conversation turn, store it in a memories table with user_id, then retrieve the k nearest neighbors by embedding similarity before composing the prompt.

// Store a memory
const embedding = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: messageText,
})

await supabase.from('memories').insert({
  user_id: userId,
  content: messageText,
  embedding: embedding.data[0].embedding,
})

// Retrieve relevant memories
const { data: memories } = await supabase.rpc('match_memories', {
  query_embedding: queryEmbedding,
  match_user_id: userId,
  match_count: 5,
})

What works: No new vendors. Full data control. The cost profile is predictable — you pay for embedding calls and Postgres storage, both of which are cheap.

What breaks: You're building the extraction and deduplication logic yourself. "What should be stored as a memory versus what's noise?" is a harder problem than it looks. Naively storing every turn produces a lot of low-signal memories that degrade retrieval quality. You'll also need to think about eviction — what happens when a user has 10,000 stored memories.

Best for: Apps where you want full data ownership, already use Supabase/Postgres, and have low-to-medium memory complexity needs.

Complexity: Medium-high (the implementation is simple; the tuning is not). Cost: Low.

The Comparison Table

Approach Complexity Cost / 1K MAU Data Control Retrieval Latency Best For
Context replay Low $ (spikes with usage) Full 0ms Short sessions, MVPs
Managed service (Mem0, Zep) Medium $$–$$$ Shared 100–300ms Repeat-user apps
Knowledge graph High $$$ Full or shared 100–500ms Complex entity relationships
DIY pgvector Medium-High $ Full 50–200ms Data-sensitive apps, Supabase users

When NOT to Add Memory

Memory adds complexity. Don't add it when:

  • The tool is single-shot. A one-time code converter or document formatter has no repeat user context to benefit from. Memory adds latency and cost for zero value.
  • Sessions don't compound. If each session is independent by design — think a quiz app or a daily-word generator — there's nothing to carry forward.
  • You have compliance constraints. HIPAA, GDPR, and similar frameworks impose strict rules on how personal data can be stored and for how long. "Store everything the user says" may require legal review before you ship it.
  • Your users aren't returning yet. Memory optimizes for repeat engagement. If you haven't validated that users come back, don't invest in infrastructure for a behavior you haven't observed.

How to Decide

Before picking an architecture, answer three questions:

1. How often do users return? Daily users need memory. Weekly users probably do. One-time users don't. If your analytics show D7 retention below 20%, fix the core product before adding memory.

2. What kind of context decays fastest? Conversation history goes stale quickly — what you discussed last week may be irrelevant today. User preferences are durable. Structured entity relationships (projects, contacts, goals) are somewhere in between. Pick an approach that matches the decay profile of your dominant memory type.

3. Do you own the data or can you share it? Managed services are faster to ship but mean your users' conversational data lives on a vendor's infrastructure. For consumer apps with sensitive use cases, DIY pgvector or a self-hosted Graphiti instance keeps you in control.

If you answered "daily users, mostly preferences and history, data control matters" — start with DIY pgvector and the extraction logic from Mem0's open-source library. If you answered "daily users, complex entity relationships, moving fast" — Zep or Mem0 managed will get you there in a day.


This is Article 1 of 5 in the "Make Your AI App Remember" series.

Related: AI Agent Memory Systems · Vector Databases · Building AI-Powered Features

Ready to build?

Go from idea to launched product in a week with AI-assisted development.