In the few hours a day I’m not spending on building Collabchek I’ve been hacking on a personal AI chat client. I’ve used several and the one that’s come closest to what I want has been sigogden’s very cool aichat project which I encourage you to check out. I’ve liberally borrowed several good ideas from their code.

As I’ve explored I’ve come to realize my requirements are somewhat unique.

  1. I want to interact with a model in a single session which lasts weeks, if not months.
  2. I don’t want to spend any time helping the model discover necessary context. This is especially true when I reference a topic discussed days before.
  3. The environment must be heavily scriptable. I don’t mean an init script that runs at startup. I mean Emacs/Vim levels of scripting and customization.
  4. The environment should use resources lightly, especially when idle. I want an environment I can leave running in a tmux session and jump to it whenever I like.

tl;dr I want a LLM chat client that sips resources, tightly integrates some scripting language, and provides sophisticated context management. Turns out these are hard to find, if they exist at all.

So I built my own but this post isn’t about it. I’m sorry :(

This post is about the memory/context management system I built for it which I’m releasing under the Apache 2.0 license.

Managing long-term context

The core problem is simple: model context windows are finite but conversations aren’t. Even with today’s larger windows you eventually hit the limit. And even if you don’t, stuffing every message into the context is wasteful and slow. You need to be selective about what goes in.

My solution is memorit, a semantic memory system written in Go. It stores chat records and retrieves them based on meaning rather than just keywords. The idea is to find contextually relevant messages from days or weeks ago without the user having to remember exact phrases.


Dr. Eldon Tyrell
"We began to recognize in them a strange obsession. After all, they are emotionally inexperienced, with only a few years in which to store up the experiences which you and I take for granted. If we gift them with a past, we create a cushion or a pillow for their emotions, and consequently, we can control them better."
— Dr. Eldon Tyrell, Blade Runner

How it works

When a message comes in, memorit does two things asynchronously: it generates a vector embedding of the text and it extracts semantic concepts. Concepts are typed entities like (person, Alice) or (place, Seattle) with an importance score. Both operations run in worker pools so ingestion stays fast.

The unique twist I added: concepts get their own embeddings too. When memorit extracts a concept it checks if that concept already exists in the database. If not, it creates a new one and generates an embedding for it. This means you can search for concepts semantically, not just by exact name match.

Say you have messages tagged with concepts like (place, Seattle), (place, Portland), and (place, San Francisco). A search for “Pacific Northwest cities” won’t match any of those strings literally but the query’s embedding will be similar to Seattle’s and Portland’s embeddings. The search finds them anyway.

Search combines three signals: vector similarity on messages, vector similarity on concepts, and keyword matching. A query like “that conversation about Alice’s trip” finds messages semantically similar to the query, messages tagged with semantically similar concepts, and messages containing matching words. Results are scored and ranked with boosts for multiple signal matches.

db, _ := memorit.NewDatabase("./memory.db", memorit.WithAIProvider(provider))
pipeline, _ := db.NewIngestionPipeline()

// Ingest a message
pipeline.Ingest(ctx, core.SpeakerTypeHuman, 
    []string{"Alice mentioned she's moving to Seattle next month"})

// Later, search for it
searcher, _ := db.NewSearcher()
results, _ := searcher.FindSimilar(ctx, "Alice relocating", 5)

The dual-embedding approach costs a bit more compute upfront but it makes retrieval much more flexible. You’re not locked into whatever exact phrasing happened to be used when the concept was first mentioned.

Memorit uses an OpenAI-compatible API abstraction so it works with Ollama, LocalAI, vLLM, or any other local inference server. I run it against models on my own hardware. As I wrote in a previous post, owning your infrastructure opens up flexibility cloud providers can’t match. No rate limits, no API costs, and you can swap models whenever you want.

Storage and recovery

Everything persists to BadgerDB, an embedded key-value store. No external database to manage. The ingestion pipeline checkpoints its progress so if your process crashes mid-batch it picks up where it left off on restart.

Limitations

Vector search does a full scan. This is fine for personal use. My scaling target is tens of thousands of chat messages so naive scanning is still fast. The design won’t scale to millions of records without adding an actual vector index. I’m looking at hnsw-go as a future optimization. The concept extraction also depends on your model’s quality. Smaller models sometimes miss nuances or miscategorize things. For my use case Llama 3.2 works well enough.

The code is available at github.com/poiesic/memorit under the Apache 2.0 license.