Business

How a Self-Improving Retrieval System Enhances Conversational Memory

May 20, 2026

Earlier, I published a post about porting the immune system’s germinal center mechanism to LLM memory. Three arms, two datasets, three mutation strategies, one consistent finding: the biological control loop (adaptive rate, tier lifecycle, decay) is sound engineering, but Gaussian perturbation of pretrained embeddings cannot improve retrieval. The adaptive rate correctly identifies which entries to mutate and how much. It just can’t produce a perturbation that helps.

The one thing that worked in that phase was not biological. Cross-encoder reranking on top of bi-encoder retrieval gave a 63% NDCG lift on LongMemEval.

I said two things would need to change: the mutation needed to be semantically informed, and the fitness signal needed to come from outside the embedding space. So I built both. Then I ran another fourteen experiments. The learned MLP adapter produces the right direction. The segmentation mutation finds the wrong granularity. A simple recall diagnostic reframed the whole project. What shipped wasn’t biological at all, but it did learn from use.

Part 1: Fixing the mutation

Two new mutation types, both designed to address the specific failures from the original experiment.

Learned MLP adapter. Instead of blind Gaussian noise, a tiny neural network predicts the perturbation: delta = f(query, embedding, xenc_score). The architecture is 769 inputs (384 + 384 + 1 cross-encoder score), 128 hidden with ReLU, 384 output. About 148,000 parameters.

It trains online during the experiment loop using a differentiable loss:

loss = -(sigmoid(xenc_score) - 0.5) * cos(normalize(embedding + delta), query)

When the cross-encoder says the entry is relevant (high score), the loss drives the delta toward the query. When it says irrelevant, away. The cross-encoder breaks the circularity because it scores raw text pairs, not embeddings. The MLP is shared across all entries and learns the general rule for how embeddings should move.

Segmentation mutation. Instead of changing embedding vectors, change the text itself. Split long entries with low retrieval affinity into individual sentences. Merge adjacent short entries that are always co-retrieved. This is the closest analog to how B cells find optimal epitope binding: an antibody that binds a smaller, more specific surface often has higher affinity than one that broadly but weakly binds a large area. In embedding terms, a sentence that directly answers a question embeds more precisely than a paragraph that contains the answer somewhere in the middle.

Both mutation types are governed by the same GC control loop: adaptive rate, tier lifecycle, time decay, circuit breakers.

What the MLP adapter did

The MLP trained online for 14,846 steps on NFCorpus and produced deltas with average loss of -0.081 (negative loss means coherent training signal). On LongMemEval it ran 1,447 GC-tier entries through mutation. NDCG moved from 0.2217 to 0.2218.

The problem is mechanical, not conceptual. The MLP produces deltas of norm ~0.025. After adding to the adapter and clipping to max_adapter_norm=0.5, the effective embedding shifts by about 0.0005 in cosine distance. FAISS retrieves the top-30 candidates using inner product search over 200,000 entries. A cosine shift of 0.0005 doesn’t change which entries land in the top-30. The cross-encoder reranks those 30 candidates, and the mutation is invisible.

For the MLP to matter, it would need to produce deltas large enough to change FAISS rankings. On a 200k-entry index, that means moving an entry from position 50 to position 25 or better. The cosine distances between adjacent entries at that scale are 0.001 to 0.01. The MLP’s deltas are in the right ballpark, but the selection threshold and adapter norm clip eat most of the magnitude.

The pipeline is wired correctly. The direction is right. The magnitude is wrong.

What segmentation did

Segmentation was the more ambitious idea and the clearer failure. On NFCorpus, splitting medical abstracts into individual sentences destroyed retrieval quality: NDCG dropped from 0.334 to 0.179, a 46% collapse. The corpus grew from 3,633 entries to 8,674 as sentences proliferated. On LongMemEval the damage was smaller (0.222 to 0.194) because conversation turns are already more self-contained than medical abstracts.

The failure is straightforward. “Results showed statistical significance at p < 0.01” means nothing without the preceding methods section. A conversation turn “I prefer the blue one” means nothing without the turn that established what “one” refers to. Splitting destroys coreference and coherence. The embedding model encodes these fragments faithfully, but what it encodes is a decontextualized string that doesn’t match any real query.

The biological analogy was appealing: B cells find optimal binding by narrowing their target. But the analogy breaks down because text isn’t a protein surface. Proteins have local binding sites that function independently. Text has long-range dependencies that don’t survive fragmentation.

Merging, the other half of segmentation, barely triggered. It requires two adjacent entries to both have high affinity and be co-retrieved in the same query. On a 200k-entry corpus with k=10 retrieval, the probability of two adjacent turns both appearing in the same top-10 is low.

Part 2: The diagnostic that changed the project

After the MLP adapter landed at +0.4% I stopped tuning and ran a recall diagnostic:

Bi-encoder recall@30:   29.0%
Bi-encoder recall@100:  43.8%
Bi-encoder recall@500:  63.6%
Oracle recall:         100%

k_fetch=30  + xenc:  NDCG=0.2217
k_fetch=100 + xenc:  NDCG=0.2697  (+21.6%)
Oracle      + xenc:  NDCG=0.4709  (+112%)

78% of relevant entries were never shown to the cross-encoder. The mutation work had been optimizing the wrong layer. The cross-encoder could reach 0.47 NDCG, double the measured result, if it saw the right candidates. The bi-encoder was the ceiling.

This was the most useful hour of the project. Nothing I tried afterward beat the 0.47 oracle, but anything that moved bi-encoder recall toward oracle was unblocked gain.

Part 3: standard IR wins compound

Once the bottleneck was clear, the next three checkpoints applied well-known fixes.

Adaptive depth and a rescue cache. Shallow k=30 for easy queries, deep k=200 when cross-encoder confidence is low. Save deep-search wins for similar future queries. +28.7% on hot queries.

Deduplication. Cosine > 0.95 between entries is a signal that the corpus has near-duplicates wasting retrieval slots. 4.6% of LongMemEval was near-dupes. Dedup alone: +16.5%. Dedup plus adaptive depth: +54.1%.

BM25 alongside FAISS. This one surprised me.

System	NDCG@10
Vector only	0.1376
BM25 only	0.2420
Hybrid RRF (BM25 + vector)	0.2171
Vector + cross-encoder	0.2217
Full stack (dedup + BM25 + xenc)	0.3395

BM25 alone beats pretrained embeddings on this corpus by 76%. For long-term conversation memory, questions mostly ask about specific entities, names, and timestamps. Those tokens are exact matches. Pretrained embeddings smooth over exactly the detail the question is asking about.

This is where the “self-improving memory” thesis started to look wrong. The biggest wins to this point were not learned. They were the first two pages of a 1990s IR textbook.

Part 4: the last biology attempt, and an audit

One more attempt to rescue the GC framing. Make the routing index itself the learned part: cluster query embeddings, maintain per-cluster entry associations, run a tier lifecycle over cluster memberships. The idea was that the cluster index would learn which entries belong to which query topics over time.

Result: 0.3273 NDCG against a 0.3680 static baseline. 11% regression. The routing index added noise candidates that diluted the cross-encoder pool.

Then an integrity audit across all systems, identical eval, identical qrels. Verified the 0.3680 number comes from BM25 + vector + cross-encoder alone. No GC mechanism contributes. The biology phase was over.

Part 5: pivot to retrieval-induced forgetting

At this point a static pipeline of dedup + BM25 + vector + cross-encoder was the best I had. Anything “self-improving” had to add on top without regressing it.

I went looking for cognitive science mechanisms that had not been ported. Retrieval-induced forgetting (Anderson, 1994; Raaijmakers and Shiffrin’s SAM model, 1981) is the phenomenon where retrieving one memory actively suppresses competing memories that were activated but not selected. Zero existing AI implementations. And the shape fit: after every retrieval, the candidate pool contains winners and losers, and losers are exactly the “activated but not selected” entries.

Global RIF. After each retrieval, increment a scalar suppression score on every loser. Apply the score as a penalty to future RRF ranks. +1.1%. Marginal.

Clustered RIF. Suppression indexed per (entry, query-cluster) instead of per entry. An entry suppressed for travel queries stays available for food queries. 30 clusters of query embeddings gave +5.8%. 10 was too coarse (query topics overlap). 100 was too fine (not enough queries per cluster for stable signal).

Rank-gap formula. Only penalize a losing entry when it actually dropped in rank (initial retrieval rank above its post-rerank rank) and the cross-encoder scored it low. Not every pool loser is a “competitor” worth suppressing. Gap suppresses 30% fewer entries and gives +6.5% NDCG, +9.5% recall@30.

I went back later and ran proper bootstrap CIs and paired permutation tests over the per-query NDCG arrays (10k resamples, 10k permutations, seed 42). The picture got sharper.

Config	NDCG@10 [95% CI]	Δ vs baseline	p (perm)
static baseline	0.2960 [0.265, 0.328]	–	–
global RIF (uniform)	0.2992	+0.003 [−0.010, +0.016]	0.62
global RIF (rank-gap)	0.3038	+0.008 [−0.003, +0.019]	0.18
clustered + uniform	0.3132	+0.017 [+0.006, +0.029]	0.002
clustered + rank-gap	0.3152	+0.019 [+0.010, +0.029]	0.0001

Two things that change the story vs the original markdown table:

Global RIF is not significant. The “+1.1% / +2.6%” numbers sit inside the noise floor. Both CIs cross zero. Cue-dependent clustering is the load-bearing component; global suppression on its own is not distinguishable from chance.
The rank-gap refinement on top of clustered is not individually significant at n=500. A pairwise permutation test between clustered-uniform and clustered-rank-gap gives Δ NDCG +0.002 (p=0.55) and Δ Recall +0.011 (p=0.055). Direction consistent with the mechanism, significance not established on this amount of data.

So the clean claim is: clustering (cue-dependence) is the real lever. Rank-gap suppresses 30% fewer entries than uniform and shows a small recall trend, and the efficiency is a legitimate win, but calling it a quality improvement over uniform is not defensible at this sample size.

Part 6: Keeping Honest

Exploration and rescue. Symmetric to RIF. Pull up false negatives: periodically sample positions 31-80, cross-encoder score them, save the wins into a per-cluster rescue list, inject top-K rescues into future pools. On a 100-query fast benchmark: +2.5%. On the full 500-query eval: -2.6pp. The fast benchmark was noise. Rescue injection adds candidates that mislead RIF’s winner/loser identification at scale. Dropped.

Sparse Distributed Memory. Built Kanerva’s 1988 SDM from scratch. Random binary hard locations in a 512-bit address space, bipolar counters, top-N activation. Tested on a synthetic episodic dataset with four noise modes (partial, paraphrase, fragment, noisy). FAISS dense cosine won outright: 0.621 precision@1 against SDM’s 0.225. Binary quantization throws away the signal that dense cosine exploits. The paradigm shift didn’t help here. Closed.

Two negative results worth the time. Benchmark-tuning on a small sample is not evidence. Biology-inspired architectures are not automatically better than what was already figured out in the 1980s.

Part 6b: a third negative result, or a scope claim

After I wrote this post the first time, a reviewer-shaped voice in my head pointed out that a single benchmark is one data point. So I ran the same five configurations on NFCorpus, a BEIR medical IR benchmark (3,633 docs, 323 queries with graded relevance). Same pipeline, same hyperparameters, burn-in reduced from 5,000 to 3,000 steps because the query pool is smaller.

The mechanism didn’t transfer. Three of four RIF variants significantly regress against the baseline.

Config	NDCG@10	Δ vs baseline	p (perm)
baseline	0.3462	–	–
global RIF (uniform)	0.3198	−0.026	0.0001
global RIF (rank-gap)	0.3341	−0.012	0.007
clustered + uniform	0.3247	−0.022	0.0002
clustered + rank-gap	0.3423	−0.004	0.20

Only clustered-with-rank-gap stays within noise of baseline. The other three actively hurt retrieval with p < 0.02.

Two things are going on:

Corpus saturation. On LongMemEval, 3,000 burn-in steps touch about 4% of the 199k-turn corpus. On NFCorpus, the same 3,000 steps over a 323-query pool with replacement means each query gets ~9 exposures, and suppression accumulates on the same 30 candidates each time. By the end, 68% of the 3,633-doc corpus has non-zero suppression. We are suppressing most of the corpus. Legitimate relevant docs are inside the suppressed pool.

Workload mismatch. The mechanism was designed for a single user’s accumulating conversation where the same information needs recur. NFCorpus queries are independent medical questions. There is no user-specific topic repetition. Cue-dependent suppression requires “cues” that correspond to recurring information needs; on NFCorpus they’re closer to static topic labels, and suppression within a cluster does not generalize across the independent queries that happen to land in it.

This result changes how I frame what shipped. It is not a universal retrieval improvement. It is a mechanism for the specific failure mode of chronic false positives in a single user’s long-term conversation memory. On workloads that match that profile, the LongMemEval result holds with p<0.001. On workloads that don’t, it can hurt. The arXiv paper now scopes the claim this way explicitly, and the NFCorpus negative result sits in its own section rather than getting buried.

This is the part of the project I’m most glad I did honestly. The LongMemEval gain is real but narrow. Saying so gives the next person reaching for this tool a better chance of using it where it works.

Part 7: What RIF actually does

Decomposed RIF’s +6.8% NDCG into behavior-level metrics instead of a single number.

Metric	baseline	RIF	Δ
exact_episode (top-1 ∈ qrels)	0.208	0.222	+0.014
ndcg@10	0.293	0.316	+0.023
wrong_family (top-1 from unrelated session)	0.688	0.672	-0.016
sibling_confusion (top-1 right session, wrong turn)	0.104	0.106	+0.002
stale_fact	0.205	0.205	0.000
abstain@2	0.324	0.310	-0.014

The entire NDCG gain is wrong_family reduction. RIF suppresses cross-topic noise. It does nothing for within-session discrimination (that’s an embedding granularity problem) and nothing for temporal awareness (RIF has no time axis).

Useful to know before trying to extend RIF into those failure modes. The mechanism has a specific function, not a general one.

Part 8: the biggest single lever

Write-time LLM enrichment. For each memory chunk, run it through Claude Haiku once and generate a gist, three anticipated queries, entities, and temporal markers. Index all of that alongside the raw text in BM25 and vector. The cross-encoder still scores against the raw text.

Enriched 975 of the roughly 10k answer-relevant entries (15% of queries had at least one enriched target in qrels).

Covered bucket:

Metric	baseline	RIF	RIF + enriched	Δ (enr – RIF)
exact_episode	0.267	0.280	0.333	+0.053 (+19% rel)
ndcg@10	0.350	0.390	0.473	+0.083 (+21% rel)
wrong_family	0.720	0.707	0.653	-0.053
abstain@4	0.280	0.253	0.227	-0.027 (more confident)

+8.3pp NDCG over RIF on covered queries. 3.6× larger than RIF’s own contribution. Biggest single-lever gain in the project.

The mechanism is “anticipated queries” shrinking vocabulary mismatch. When the raw memory says “switched from Postgres to SQLite for local dev” and the user later asks “what local database does the project use”, the anticipated-queries field generated at write time contains phrasings close to the query.

Cost: $1.6 for 1000 entries on Haiku. $16 projected for full coverage. Prompt caching didn’t help (system prompt too short to cache on this model).

Part 9: porting the whole thing to Rust

Once the research stack stabilized, the Python implementation became the bottleneck. Cold-start was hundreds of milliseconds before the first byte of a query, every Claude Code hook re-imported torch + sentence-transformers, and DuckDB’s Python bindings were the slowest part of --all cross-project search. So I ported it to Rust.

The result is a Cargo workspace of eight crates. lethe-core holds the retrieval and storage primitives: BM25, RRF, k-means, flat ANN, RIF, tokeniser, all hand-written. ONNX inference is ort. Persistence is duckdb. No web framework, no async runtime in core, no service. The CLI (lethe-cli) embeds a ratatui TUI (lethe-tui). Two binding crates ship the same library to PyPI (lethe-py via PyO3) and npm (lethe-node via napi-rs). One cargo build produces every artifact; one release workflow fans out to crates.io, Homebrew, PyPI, and npm in parallel.

A few engineering things were worth the effort:

Cold ONNX load is now ~600ms and dominates warm queries. The Python equivalent paid that on every subprocess invocation; the Rust binary pays it once per lethe shell process, and zero times per query inside the TUI.
Cross-project search dropped from 6.3–7.3s to 1.7–1.8s on 9 local projects after batching the cross-encoder over the union of candidates instead of reranking per project. The Python version reranked per project and then merged.
A shared ort::Session behind a Mutex will silently bottleneck rayon pipelines. I made that mistake in the first port. Batching across the union of candidates is much better than parallelism over per-project models.
duckdb + ATTACH ... READ_ONLY is what makes lethe search --all work without any central service. Each project keeps its own DuckDB file; the CLI ATTACHes them at query time and joins. No sync, no shared infrastructure, every project still runs fully offline.

The interactive TUI is the part I use myself. lethe with no subcommand opens it when stdout is a terminal: type to search, ↑/↓ to nav, ⏎ to open the matched chunk, --all toggles cross-project. It’s the fastest way to confirm “did we ever talk about this?”.

The Rust port also kept the research and production code in sync. A parity bench (research_playground/rust_migration/) has three suites: end-to-end LongMemEval NDCG, per-component numerical diff (BM25 / FlatIP / cross-encoder logits), and cold-start + warm-retrieve latency. Each accepts --impl python or --impl rust and emits the same JSON shape; --compare runs both and writes a markdown report. I ran it on every Rust commit that touched retrieval until deltas stayed inside the tolerance band (|ΔNDCG| ≤ 0.005, |ΔRecall| ≤ 0.005, FlatIP top-30 Jaccard ≥ 0.99, |Δxenc| ≤ 1e-3). Python is now the reference, not the production path.

Plugins

The CLI is the foundation; the daily-use shape is the plugin layer.

Claude Code plugin. A Stop hook writes a one-paragraph summary of the session into .lethe/memory/YYYY-MM-DD.md. A recall skill surfaces prior memory when Claude decides the question would benefit from history; a recall-global skill searches across every registered project at once. Install: /plugin marketplace add teimurjan/lethe && /plugin install lethe. Both skills are gated by Claude’s own decision to call them, so there’s no token cost on queries that don’t need history.

Codex CLI plugin. Same shape, different host: ~/.codex/lethe/ with hooks + skills, installer patches ~/.codex/config.toml between sentinel markers (re-runs replace the previous block, so updates are clean). Codex doesn’t expose transcript summarization yet, so the recall skills work but the auto-summarize half is pending the format settling.

Polyglot release. pip install lethe-memory ships the PyPI binding. npm install @lethe-memory/lethe ships the napi-rs one. brew install lethe and cargo install lethe-cli ship the binary. All four track the same workspace version, bumped by release-please on conventional commit titles, published in parallel by one workflow.

Part 10: The BM25 tokenizer was a punctuation tax

A code-review pass during the Rust port (Codex flagged it) caught BM25 still tokenizing with text.lower().split() from checkpoint 8. Under split(), "MongoDB?" and "MongoDB" are different tokens, so every query ending in punctuation silently missed the corresponding corpus turn.

Swept four tokenizers on a 100-query random sample:

Tokenizer	NDCG@10	Recall@10	BM25 build
baseline (`lower().split()`)	0.302	0.373	11.2s
regex `[A-Za-z0-9_]+`	0.339 (+3.7pp)	0.441 (+6.8pp)	10.4s
regex + stopword removal	0.308 (+0.6pp)	0.403	10.6s
regex + Porter stemming	0.315 (+1.3pp)	0.414	187.8s

Two non-obvious things came out of the sweep.

Stopword removal regresses. Dropping function words (“the”, “of”, “is”) erases most of the gain. Conversational queries are short and specific; function words act as syntactic anchors, and removing them is net-harmful on this corpus. Standard IR advice from Wikipedia/news corpora doesn’t translate.

Porter stemming is a trap. Build cost jumps 17× (10s → 188s), NDCG tops out at +1.3pp, well below plain regex. Over-conflation (generate/general → gener) costs more than the vocabulary-compression gain on conversational text.

On the full 200-query headline benchmark the swap lifted the production pipeline from 0.368 → 0.382 NDCG@10 (BM25-only jumped from 0.242 to 0.317). Bigger single-lever quality win than clustered-RIF on the same corpus, measured on the same eval. The lift is punctuation-only: the 200k conversational turns are heavy on trailing ?, ., contractions, and hook-written session anchors that previously didn’t tokenize cleanly.

Side effect: re-running the RIF benchmark on this stronger baseline halved RIF’s relative gain (clustered+gap went from +6.5% to +3.4%) but improved the absolute NDCG (0.315 → 0.342) and Recall@30 (+4.8pp). Better-base-leaves-less-to-recover, the mechanism still net-positive. The cue-dependence of clustered RIF is what keeps it in the money when the base improves; global RIF is now indistinguishable from baseline.

Part 11: BM25 is the load-bearing component (six independent confirmations)

After enrichment + tokenizer + Rust port, I had a stable production stack at 0.382 NDCG@10 and a free hand to ablate the model layer. Two months and six experiments later, the conclusion is the same in every direction: BM25 is the load-bearing component on conversational long memory; the cross-encoder and bi-encoder are downstream fitters, not the limiter.

11a. Cross-encoder shootout

Held the bi-encoder at MiniLM-L6 and swapped the reranker. 50-query subsample.

Reranker	Params	`lethe_full` NDCG@10	Δ NDCG	Slowdown
`Xenova/ms-marco-MiniLM-L-6-v2` (current)	22M	0.376	–	1.0×
`Xenova/ms-marco-MiniLM-L-12-v2`	33M	0.378	+0.14pp	2.0×
`jinaai/jina-reranker-v1-turbo-en`	38M	0.376	−0.07pp	1.25×
`jinaai/jina-reranker-v2-base-multilingual`	278M	0.301	−7.5pp	6.8×

Bigger models cost more and don’t deliver. Jina-v2-base-multilingual actively regresses by 7.5pp at 6.8× the cost. The “+10pt over MiniLM” claims on 2025 reranker leaderboards (mxbai-rerank-v2, jina-v2) are measured on BEIR / general IR; conversational chat memory is a different workload, and the calibration to MS-MARCO logits + BM25-shaped pools pins MiniLM-L6 on the local Pareto frontier. Bge-reranker-v2-m3 (568M) extrapolates to ~50s/query on this CPU; mxbai-rerank-base-v2 (500M Qwen2.5) ~40-60s/query. Both disqualified for interactive use without GPU.

11b. Bi-encoder swap

Held the cross-encoder at MiniLM-L6 and swapped to Xenova/bge-small-en-v1.5 (33M, 384D, CLS pooling, int8). Required new harness plumbing: BiEncoder::from_repo_full(repo, onnx_variant, pooling) + a cmd_prepare-embeddings subcommand to re-encode the 199k corpus into a parallel cache.

Config	MiniLM	BGE-small	Δ
`vector_only`	0.158	0.191	+3.3pp
`bm25_only`	0.358	0.358	0.0pp
`vector_xenc`	0.227	0.263	+3.6pp
`lethe_full`	0.376	0.368	−0.8pp

BGE is genuinely a better embedder: vector_only +3.3pp, vector_xenc +3.6pp, exactly the magnitude MTEB / BEIR predict. The lift mostly washes out in lethe_full. Mechanism: lethe_full unions BM25(top-30) ∪ dense(top-30) → rerank top-60. On this corpus bm25_only (0.358) is 2.3× stronger than vector_only (0.158). BM25 dominates the rerank pool; better dense delivers more relevant docs (R@10 +1.7pp), but the cross-encoder, already saturating on what it can rank, can’t reorder them above the BM25-supplied entries it was already picking well.

11c. Multi-field BM25 + field-boosted BM25

The reranker + bi-encoder ablation pointed at “enrich BM25’s signal” as the remaining lever. The cheapest version: index body, entities, title separately and fuse via RRF, using regex extractors instead of LLM-generated text. URL / CamelCase / acronym / snake_case / file-path / version / hex hash / backtick spans for entities; first-non-empty-line truncated at 120 chars for title. Two formulations, both regressed.

Config	NDCG@10	R@10	vs `lethe_full`
`bm25_only` (body)	0.358	0.431	–
`bm25_boost_only` (body + 2×entities + title)	0.343	0.417	(BM25-alone)
`lethe_full` (control)	0.376	0.503	baseline
`lethe_multifield` (body / entities / title / dense → equal-weight RRF → rerank)	0.340	0.453	−3.6pp / −5.0pp
`lethe_field_boost` (single BM25 over body + 2×entities + title, then dense union, rerank)	0.343	0.463	−3.3pp / −4.0pp

The decisive number is bm25_boost_only 0.343 < bm25_only 0.358: the boost degrades the BM25 index itself, before the reranker sees anything. The failure is upstream of fusion. Three reasons: (1) naive concatenation breaks BM25 length normalization (b=0.75 penalizes body-term TF when entity tokens lengthen the doc; proper multi-field needs BM25F, not concatenation); (2) equal-weight RRF over four sources dilutes the dominant body signal (entities populate only 37% of chunks since chat memory is mostly prose, so RRF promotes entity-only matches body BM25 had correctly deprioritized); (3) regex entities are the wrong vocabulary bridge: users query in natural language (“how do I configure X”), not with the syntactic tokens regex extracts (BiEncoder, 4aae737, 0.10.0).

The structural insight is what makes this informative as a negative: the kind of enrichment matters more than the field architecture. What checkpoint 17 produces (gist, anticipated_queries) succeeds for the same reason regex entities fail: anticipated queries are written in the user’s vocabulary, so they bridge the BM25 query–document gap directly. That’s the +8.3pp covered-bucket result; it can’t be cheaply substituted by regex.

Hypothesis from Jina (2024): per-turn embeddings lose conversational context; encoding the whole session with a long-context model and mean-pooling per turn gives each turn a context-aware embedding. Cited gain on long-document benchmarks: +1.5–6.5 NDCG@10 absolute.

Local CPU encoding capped at 0.4–1.5 sessions/sec on a 139M-param nomic-embed-text-v1.5 int8, full prep ~14 hours. Built a Modal app instead: prep_late.py runs on a single L40S, 70 sessions/sec, ~3.5 min for the full 199,509-turn corpus, ~$0.30 spend. 99.98% of sessions fit in one 8192-token forward pass.

Config	MiniLM (control)	BGE-small	nomic-v1.5 LATE	Δ vs control
`vector_only`	0.158	0.191	0.110	−4.8pp
`bm25_only`	0.358	0.358	0.358	0
`vector_xenc`	0.227	0.263	0.197	−3.0pp
`lethe_full` NDCG	0.376	0.368	0.365	−1.2pp
`lethe_full` R@10	0.503	0.520	0.483	−2.0pp

Two things to call out. The dense leg regressed 4.8pp on vector_only because the encoding script prepended search_document: to each turn before packing into one session, so the model saw [CLS] search_document: turn1 [SEP] search_document: turn2 [SEP] .... Nomic-v1.5 was trained with the prefix appearing once per document; multiple in-document prefixes are out-of-distribution and corrupt the per-turn pooled embeddings. Easy fix (single prefix at session start, per-turn spans excluding prefix tokens). I didn’t run it because the second observation makes the fix moot: lethe_full is flat within 50q bootstrap noise (±2–3pp). Even a totally different long-context dense embedder + session-level context awareness does not move the production metric.

Third independent confirmation of the BM25-dominance pattern. Cross-encoder swap, multi-reranker shootout, BGE bi-encoder swap, multi-field BM25, field-boosted BM25, and now late chunking: all six converge on the same ceiling for lethe_full on this corpus. The Modal harness stays in the repo (research_playground/late_chunking_modal/) as a reusable “cheap GPU embed” tool; the late-chunking question itself is closed for this workload.

11e. The latency knob the TUI exposed

One non-quality lever I did pull. Checkpoint 6 set adaptive deep-pass k_deep=200 and never measured smaller alternatives. Once the TUI made cross-project search user-visible (N × k_deep rerank cost in the worst case), I swept it on the production pipeline:

config	NDCG@10	Recall@10	p50	p95
shallow-only	0.287	0.349	1689 ms	2371 ms
`k_deep=60`	0.291	0.353	4159 ms	5936 ms
`k_deep=100`	0.302	0.373	5651 ms	7440 ms
`k_deep=200` (old)	0.302	0.373	9800 ms	12650 ms

k_deep=100 matches the old 200 on quality and cuts p50 by 42%, p95 by 41%. The cross-encoder’s top-10 picks stabilize by merged rank ~100 on this workload; ranks 101-200 never won a top-10 slot, so the old 200 was pure latency tax. Shipped as the new default; the knob is exposed on the MemoryStore / UnionStore constructors.

I also tried two hardware levers that didn’t pan out: int8-quantized BGE-small (4.9× synthetic throughput, 0.23× on real conversational turns because BGE’s 512-token cap doubles per-item compute on long turns); CoreML execution provider on Apple Silicon (10.5× slower on the bi-encoder because onnxruntime’s CoreML partitioner only covers ~72% of the graph and every forward pass pays a Metal/ANE round-trip that won’t amortize on a 22M-param model). Logged in the journal so future explorers don’t rediscover.

Part 12: what shipped (current)

The stack that survived all of this:

Storage. Markdown files under .lethe/memory/*.md, one day per file, ##/### sections as chunks. Readable with cat, editable with any editor. Content-hashed for incremental reindex.
Index. DuckDB, one file per project at .lethe/index/lethe.duckdb. BM25 tokens (regex-tokenized), FAISS-equivalent dense vectors, k-means cluster centroids, per-cluster RIF suppression state, all in one file. No server, no external vector DB.
Cross-project search. DuckDB ATTACH ... READ_ONLY. lethe search --all opens every registered project’s DuckDB simultaneously, fans the cross-encoder over the union of candidates (one batched rerank pass, not N), merges via RRF.
Retrieval. BM25 (regex tokenizer) + dense (MiniLM-L6 ONNX) + cross-encoder rerank (ms-marco MiniLM-L6) with k_deep=100, clustered RIF (rank-gap, 30 clusters, query-based centroids), 0.95 cosine dedup. Optional Haiku enrichment at write time.
Distribution. Single Rust binary (lethe, ratatui TUI when invoked with no args). PyPI binding (lethe-memory, PyO3). npm binding (lethe, napi-rs). Claude Code plugin (Stop-hook auto-summarization + recall / recall-global skills). Codex CLI plugin (same skills, transcript summarization pending).

The name is from the Greek river of forgetting. The store is as much defined by what it suppresses as by what it retrieves.

Benchmark methodology note

The numbers above are NDCG@10 on turn-level retrieval over the full 199,509-turn LongMemEval S corpus. That is needle-in-haystack search against 200k candidates.

Other memory-tool benchmarks commonly report recall@5 on per-query fresh databases of roughly 50 sessions at session granularity. That is a roughly 2000x easier task (random baseline 10% vs 0.005%). Some implementations additionally leak ground truth via annotation fields at indexing time. Published numbers in the 95-99% range on that methodology are state-of-the-art for that methodology. They are not directly comparable to anything here.

A head-to-head on a shared methodology in either direction is a separate experiment that has not been run.

What’s next

Reranker swap, bi-encoder swap, rerank-pool widening, multi-field BM25, late chunking: all dropped off the candidate list. Six independent ablations confirm none is the limiter on this corpus. The remaining open directions are either expensive (LLM enrichment) or scope-defining (cross-dataset replication, head-to-head benchmarks).

Scale enrichment to full answer-relevant coverage. ~$16 + a few hours on Haiku. Confirms the covered-bucket numbers at scale; should produce a clean +8pp NDCG story instead of the diluted partial-coverage one. Reinforced by the multi-field BM25 negative: cheap regex substitutes for anticipated-queries don’t carry the lift, so paying for actual LLM-generated enrichment is the only path that gives BM25 the right signal.
Replicate clustered RIF on a second long-term conversation memory benchmark (LoCoMo, MSC, LongMemEval M). The NFCorpus negative shows the mechanism doesn’t transfer to ad-hoc IR; a second in-scope dataset would strengthen the conversational-memory claim independently.
Head-to-head against other memory tools on shared methodology. Run lethe under their setup (per-query ~50 sessions, recall@5), and theirs under LongMemEval S. Honest comparison, honest methodology.
Move failure modes that haven’t budged. Sibling confusion (within-session embedding granularity) and stale fact (temporal awareness). Both need different mechanisms from RIF or enrichment. Candidates: session-structured reranking, temporal-aware tie-breaking, explicit fact extraction with validity windows.

The lesson

There are two now.

The first (still): check the bottleneck before extending the mechanism. For three months I tuned a mutation layer that couldn’t matter regardless of tuning, because the ceiling it was pushing against (the cross-encoder reranker on 30 candidates) was the one not being fed the right inputs. An hour of recall measurement reframed three months of work.

The second: when one component dominates, model-layer swaps are theatre. Two months of post-enrichment ablations (better reranker, better bi-encoder, multi-field BM25, late chunking on a GPU) all bounced off the same ceiling because BM25 was supplying the candidates the reranker was already ranking well. 2025 RAG-survey advice (“just swap to mxbai-rerank-v2”, “just swap to BGE”, “just use late chunking”) implicitly assumes a balanced pipeline. Conversational long memory isn’t balanced: bm25_only is 2.3× stronger than vector_only on this corpus, so the dense leg can lift on isolated benchmarks (vector_only, vector_xenc) without ever displacing a BM25-supplied entry from the rerank pool. The lever is in what feeds BM25, not in the model on top.

Both lessons rhyme. The interesting question is rarely “which model is better”; it’s “which component is currently the rate limiter, and why.” Until you know that, every swap is noise.

This article was originally published by Teimur Gasanov on HackerNoon.

HackerNoon

VIEW ALL POSTS

< Next Post

DARPA ‘MASCAL’ to simulate a mass casualty event

Previous Post >

With 190 organizations now in the Agentic AI Foundation, Solvd joins – setting enterprise AI’s ground rules

Business

What does an 18-time wheelchair tennis Grand Slam champion feed his dog?

Andy Lapthorne, the United Kingdom’s most prolific wheelchair tennis player, is also a dog...

July 2, 2026 Sociable Team

Business

Build Talent Labs crowns the winners of first U.S. awards show for immigration lawyers

Build Talent Labs, an incubator of immigration pathways for global talent, has named the people’s...

June 24, 2026 Salome Beyer Velez

Business

As AI startups multiply, ElevenX Capital doubles down on the venture studio model

Across the globe there are an estimated 70,000 AI startups, according to Hubspot. As the technology...

June 17, 2026 Elena Rodríguez

Sociable's Podcast

Brains Byte Back

Brains Byte Back interviews startups, entrepreneurs, and industry leaders that tap into how our brains work. We explore how knowledge & technology intersect to build a better, more sustainable future for humanity. If you’re interested in ideas that push the needle, and future-proofing yourself for the new information age, join us every Friday. Brains Byte Back guests include founders, CEOs, and other influential individuals making a big difference in society, with past guest speakers such as New York Times journalists, MIT Professors, and C-suite executives of Fortune 500 companies.

88% of companies are deploying AI this year. Only 1 in 20 will get real value out of it. A new role is being created inside the companies actually getting it right — and it doesn't require a computer science degree.

Most companies are buying AI tools before they've figured out what problem they're trying to solve. That's a big reason only 1 in 20 enterprise AI projects actually deliver measurable value — and why the other 95% end in millions of wasted spend, stalled rollouts, and in some cases, real damage.

A new role is emerging to sit in front of all of that. Someone who walks into a company, figures out where AI actually belongs, where it doesn't, and what guardrails it needs once it's running. In this episode of Brains Byte Back, host Erick Espinosa sits down with two of the first people holding that title — Luis Escalante, AI Delivery Manager at Gorilla Logic, and Siddardha Vangala, Senior AI Applications Developer at MasTec Advanced Technologies.

They explain what the job actually is, what it isn't, and why the people most qualified for it often don't realize they already have the skills.

If you've been watching the AI boom from the outside, wondering where you fit, this episode is the answer.

Reach out to today's host, Erick Espinosa – [email protected]

Get the latest on tech news – https://sociable.co/

Leave an iTunes review – https://rb.gy/ampk26

Search Episodes

Why Every Company Is About to Hire an AI Manager (No Coding Required)

May 15, 2026

The Sociable

You Made the Song. Now What? How Neural Frames Is Giving Independent Musicians a Visual Voice

April 29, 2026

The Sociable

What Sitting All Day Is Doing to Your Body (And Why You Don’t Notice It)

March 17, 2026

The Sociable

The Push and Pull: How and Why the EU Forced Apple to Open iPhone App Distribution

February 18, 2026

The Sociable

The Question Isn’t Whether AI Will Replace Creativity, It’s How It Will Expand It

January 21, 2026

The Sociable

AI Business Scams Are Surging: Here Are the Top 3 Threats Your Team Is Likely to Face in 2026

December 11, 2025

The Sociable

From Building Startups Before High School to Scaling Sales with AI: How a Young Founder is Modernizing Outreach

October 2, 2025

The Sociable

AI and 3D in Construction: Building Smarter, Faster, On Track

September 16, 2025

The Sociable

Why Latin America’s Tech Future Depends on Women in Leadership

August 21, 2025

The Sociable

Shift Left, Ship Fast: How Software Teams Can Offer Speed Without Sacrificing Quality

July 22, 2025

The Sociable

Search Results placeholder

How a Self-Improving Retrieval System Enhances Conversational Memory