Earlier, I published a post about porting the immune system’s germinal center mechanism to LLM memory. Three arms, two datasets, three mutation strategies, one consistent finding: the biological control loop (adaptive rate, tier lifecycle, decay) is sound engineering, but Gaussian perturbation of pretrained embeddings cannot improve retrieval. The adaptive rate correctly identifies which entries to mutate and how much. It just can’t produce a perturbation that helps.
The one thing that worked in that phase was not biological. Cross-encoder reranking on top of bi-encoder retrieval gave a 63% NDCG lift on LongMemEval.
I said two things would need to change: the mutation needed to be semantically informed, and the fitness signal needed to come from outside the embedding space. So I built both. Then I ran another fourteen experiments. The learned MLP adapter produces the right direction. The segmentation mutation finds the wrong granularity. A simple recall diagnostic reframed the whole project. What shipped wasn’t biological at all, but it did learn from use.
Part 1: Fixing the mutation
Two new mutation types, both designed to address the specific failures from the original experiment.
Learned MLP adapter. Instead of blind Gaussian noise, a tiny neural network predicts the perturbation: delta = f(query, embedding, xenc_score). The architecture is 769 inputs (384 + 384 + 1 cross-encoder score), 128 hidden with ReLU, 384 output. About 148,000 parameters.
It trains online during the experiment loop using a differentiable loss:
loss = -(sigmoid(xenc_score) - 0.5) * cos(normalize(embedding + delta), query)
When the cross-encoder says the entry is relevant (high score), the loss drives the delta toward the query. When it says irrelevant, away. The cross-encoder breaks the circularity because it scores raw text pairs, not embeddings. The MLP is shared across all entries and learns the general rule for how embeddings should move.
Segmentation mutation. Instead of changing embedding vectors, change the text itself. Split long entries with low retrieval affinity into individual sentences. Merge adjacent short entries that are always co-retrieved. This is the closest analog to how B cells find optimal epitope binding: an antibody that binds a smaller, more specific surface often has higher affinity than one that broadly but weakly binds a large area. In embedding terms, a sentence that directly answers a question embeds more precisely than a paragraph that contains the answer somewhere in the middle.
Both mutation types are governed by the same GC control loop: adaptive rate, tier lifecycle, time decay, circuit breakers.
What the MLP adapter did
The MLP trained online for 14,846 steps on NFCorpus and produced deltas with average loss of -0.081 (negative loss means coherent training signal). On LongMemEval it ran 1,447 GC-tier entries through mutation. NDCG moved from 0.2217 to 0.2218.
The problem is mechanical, not conceptual. The MLP produces deltas of norm ~0.025. After adding to the adapter and clipping to max_adapter_norm=0.5, the effective embedding shifts by about 0.0005 in cosine distance. FAISS retrieves the top-30 candidates using inner product search over 200,000 entries. A cosine shift of 0.0005 doesn’t change which entries land in the top-30. The cross-encoder reranks those 30 candidates, and the mutation is invisible.
For the MLP to matter, it would need to produce deltas large enough to change FAISS rankings. On a 200k-entry index, that means moving an entry from position 50 to position 25 or better. The cosine distances between adjacent entries at that scale are 0.001 to 0.01. The MLP’s deltas are in the right ballpark, but the selection threshold and adapter norm clip eat most of the magnitude.
The pipeline is wired correctly. The direction is right. The magnitude is wrong.
What segmentation did
Segmentation was the more ambitious idea and the clearer failure. On NFCorpus, splitting medical abstracts into individual sentences destroyed retrieval quality: NDCG dropped from 0.334 to 0.179, a 46% collapse. The corpus grew from 3,633 entries to 8,674 as sentences proliferated. On LongMemEval the damage was smaller (0.222 to 0.194) because conversation turns are already more self-contained than medical abstracts.
The failure is straightforward. “Results showed statistical significance at p < 0.01” means nothing without the preceding methods section. A conversation turn “I prefer the blue one” means nothing without the turn that established what “one” refers to. Splitting destroys coreference and coherence. The embedding model encodes these fragments faithfully, but what it encodes is a decontextualized string that doesn’t match any real query.
The biological analogy was appealing: B cells find optimal binding by narrowing their target. But the analogy breaks down because text isn’t a protein surface. Proteins have local binding sites that function independently. Text has long-range dependencies that don’t survive fragmentation.
Merging, the other half of segmentation, barely triggered. It requires two adjacent entries to both have high affinity and be co-retrieved in the same query. On a 200k-entry corpus with k=10 retrieval, the probability of two adjacent turns both appearing in the same top-10 is low.
Part 2: The diagnostic that changed the project
After the MLP adapter landed at +0.4% I stopped tuning and ran a recall diagnostic:
Bi-encoder recall@30: 29.0%
Bi-encoder recall@100: 43.8%
Bi-encoder recall@500: 63.6%
Oracle recall: 100%
k_fetch=30 + xenc: NDCG=0.2217
k_fetch=100 + xenc: NDCG=0.2697 (+21.6%)
Oracle + xenc: NDCG=0.4709 (+112%)
78% of relevant entries were never shown to the cross-encoder. The mutation work had been optimizing the wrong layer. The cross-encoder could reach 0.47 NDCG, double the measured result, if it saw the right candidates. The bi-encoder was the ceiling.
This was the most useful hour of the project. Nothing I tried afterward beat the 0.47 oracle, but anything that moved bi-encoder recall toward oracle was unblocked gain.
Part 3: standard IR wins compound
Once the bottleneck was clear, the next three checkpoints applied well-known fixes.
Adaptive depth and a rescue cache. Shallow k=30 for easy queries, deep k=200 when cross-encoder confidence is low. Save deep-search wins for similar future queries. +28.7% on hot queries.
Deduplication. Cosine > 0.95 between entries is a signal that the corpus has near-duplicates wasting retrieval slots. 4.6% of LongMemEval was near-dupes. Dedup alone: +16.5%. Dedup plus adaptive depth: +54.1%.
BM25 alongside FAISS. This one surprised me.
| System | NDCG@10 |
|---|---|
| Vector only | 0.1376 |
| BM25 only | 0.2420 |
| Hybrid RRF (BM25 + vector) | 0.2171 |
| Vector + cross-encoder | 0.2217 |
| Full stack (dedup + BM25 + xenc) | 0.3395 |
BM25 alone beats pretrained embeddings on this corpus by 76%. For long-term conversation memory, questions mostly ask about specific entities, names, and timestamps. Those tokens are exact matches. Pretrained embeddings smooth over exactly the detail the question is asking about.
This is where the “self-improving memory” thesis started to look wrong. The biggest wins to this point were not learned. They were the first two pages of a 1990s IR textbook.
Part 4: the last biology attempt, and an audit
One more attempt to rescue the GC framing. Make the routing index itself the learned part: cluster query embeddings, maintain per-cluster entry associations, run a tier lifecycle over cluster memberships. The idea was that the cluster index would learn which entries belong to which query topics over time.
Result: 0.3273 NDCG against a 0.3680 static baseline. 11% regression. The routing index added noise candidates that diluted the cross-encoder pool.
Then an integrity audit across all systems, identical eval, identical qrels. Verified the 0.3680 number comes from BM25 + vector + cross-encoder alone. No GC mechanism contributes. The biology phase was over.
Part 5: pivot to retrieval-induced forgetting
At this point a static pipeline of dedup + BM25 + vector + cross-encoder was the best I had. Anything “self-improving” had to add on top without regressing it.
I went looking for cognitive science mechanisms that had not been ported. Retrieval-induced forgetting (Anderson, 1994; Raaijmakers and Shiffrin’s SAM model, 1981) is the phenomenon where retrieving one memory actively suppresses competing memories that were activated but not selected. Zero existing AI implementations. And the shape fit: after every retrieval, the candidate pool contains winners and losers, and losers are exactly the “activated but not selected” entries.
Global RIF. After each retrieval, increment a scalar suppression score on every loser. Apply the score as a penalty to future RRF ranks. +1.1%. Marginal.
Clustered RIF. Suppression indexed per (entry, query-cluster) instead of per entry. An entry suppressed for travel queries stays available for food queries. 30 clusters of query embeddings gave +5.8%. 10 was too coarse (query topics overlap). 100 was too fine (not enough queries per cluster for stable signal).
Rank-gap formula. Only penalize a losing entry when it actually dropped in rank (initial retrieval rank above its post-rerank rank) and the cross-encoder scored it low. Not every pool loser is a “competitor” worth suppressing. Gap suppresses 30% fewer entries and gives +6.5% NDCG, +9.5% recall@30.
I went back later and ran proper bootstrap CIs and paired permutation tests over the per-query NDCG arrays (10k resamples, 10k permutations, seed 42). The picture got sharper.
| Config | NDCG@10 [95% CI] | Δ vs baseline | p (perm) |
|---|---|---|---|
| static baseline | 0.2960 [0.265, 0.328] | – | – |
| global RIF (uniform) | 0.2992 | +0.003 [−0.010, +0.016] | 0.62 |
| global RIF (rank-gap) | 0.3038 | +0.008 [−0.003, +0.019] | 0.18 |
| clustered + uniform | 0.3132 | +0.017 [+0.006, +0.029] | 0.002 |
| clustered + rank-gap | 0.3152 | +0.019 [+0.010, +0.029] | 0.0001 |
Two things that change the story vs the original markdown table:
- Global RIF is not significant. The “+1.1% / +2.6%” numbers sit inside the noise floor. Both CIs cross zero. Cue-dependent clustering is the load-bearing component; global suppression on its own is not distinguishable from chance.
- The rank-gap refinement on top of clustered is not individually significant at n=500. A pairwise permutation test between clustered-uniform and clustered-rank-gap gives Δ NDCG +0.002 (p=0.55) and Δ Recall +0.011 (p=0.055). Direction consistent with the mechanism, significance not established on this amount of data.
So the clean claim is: clustering (cue-dependence) is the real lever. Rank-gap suppresses 30% fewer entries than uniform and shows a small recall trend, and the efficiency is a legitimate win, but calling it a quality improvement over uniform is not defensible at this sample size.
Part 6: Keeping Honest
Exploration and rescue. Symmetric to RIF. Pull up false negatives: periodically sample positions 31-80, cross-encoder score them, save the wins into a per-cluster rescue list, inject top-K rescues into future pools. On a 100-query fast benchmark: +2.5%. On the full 500-query eval: -2.6pp. The fast benchmark was noise. Rescue injection adds candidates that mislead RIF’s winner/loser identification at scale. Dropped.
Sparse Distributed Memory. Built Kanerva’s 1988 SDM from scratch. Random binary hard locations in a 512-bit address space, bipolar counters, top-N activation. Tested on a synthetic episodic dataset with four noise modes (partial, paraphrase, fragment, noisy). FAISS dense cosine won outright: 0.621 precision@1 against SDM’s 0.225. Binary quantization throws away the signal that dense cosine exploits. The paradigm shift didn’t help here. Closed.
Two negative results worth the time. Benchmark-tuning on a small sample is not evidence. Biology-inspired architectures are not automatically better than what was already figured out in the 1980s.
Part 6b: a third negative result, or a scope claim
After I wrote this post the first time, a reviewer-shaped voice in my head pointed out that a single benchmark is one data point. So I ran the same five configurations on NFCorpus, a BEIR medical IR benchmark (3,633 docs, 323 queries with graded relevance). Same pipeline, same hyperparameters, burn-in reduced from 5,000 to 3,000 steps because the query pool is smaller.
The mechanism didn’t transfer. Three of four RIF variants significantly regress against the baseline.
| Config | NDCG@10 | Δ vs baseline | p (perm) |
|---|---|---|---|
| baseline | 0.3462 | – | – |
| global RIF (uniform) | 0.3198 | −0.026 | 0.0001 |
| global RIF (rank-gap) | 0.3341 | −0.012 | 0.007 |
| clustered + uniform | 0.3247 | −0.022 | 0.0002 |
| clustered + rank-gap | 0.3423 | −0.004 | 0.20 |
Only clustered-with-rank-gap stays within noise of baseline. The other three actively hurt retrieval with p < 0.02.
Two things are going on:
Corpus saturation. On LongMemEval, 3,000 burn-in steps touch about 4% of the 199k-turn corpus. On NFCorpus, the same 3,000 steps over a 323-query pool with replacement means each query gets ~9 exposures, and suppression accumulates on the same 30 candidates each time. By the end, 68% of the 3,633-doc corpus has non-zero suppression. We are suppressing most of the corpus. Legitimate relevant docs are inside the suppressed pool.
Workload mismatch. The mechanism was designed for a single user’s accumulating conversation where the same information needs recur. NFCorpus queries are independent medical questions. There is no user-specific topic repetition. Cue-dependent suppression requires “cues” that correspond to recurring information needs; on NFCorpus they’re closer to static topic labels, and suppression within a cluster does not generalize across the independent queries that happen to land in it.
This result changes how I frame what shipped. It is not a universal retrieval improvement. It is a mechanism for the specific failure mode of chronic false positives in a single user’s long-term conversation memory. On workloads that match that profile, the LongMemEval result holds with p<0.001. On workloads that don’t, it can hurt. The arXiv paper now scopes the claim this way explicitly, and the NFCorpus negative result sits in its own section rather than getting buried.
This is the part of the project I’m most glad I did honestly. The LongMemEval gain is real but narrow. Saying so gives the next person reaching for this tool a better chance of using it where it works.
Part 7: What RIF actually does
Decomposed RIF’s +6.8% NDCG into behavior-level metrics instead of a single number.
| Metric | baseline | RIF | Δ |
|---|---|---|---|
| exact_episode (top-1 ∈ qrels) | 0.208 | 0.222 | +0.014 |
| ndcg@10 | 0.293 | 0.316 | +0.023 |
| wrong_family (top-1 from unrelated session) | 0.688 | 0.672 | -0.016 |
| sibling_confusion (top-1 right session, wrong turn) | 0.104 | 0.106 | +0.002 |
| stale_fact | 0.205 | 0.205 | 0.000 |
| abstain@2 | 0.324 | 0.310 | -0.014 |
The entire NDCG gain is wrong_family reduction. RIF suppresses cross-topic noise. It does nothing for within-session discrimination (that’s an embedding granularity problem) and nothing for temporal awareness (RIF has no time axis).
Useful to know before trying to extend RIF into those failure modes. The mechanism has a specific function, not a general one.
Part 8: the biggest single lever
Write-time LLM enrichment. For each memory chunk, run it through Claude Haiku once and generate a gist, three anticipated queries, entities, and temporal markers. Index all of that alongside the raw text in BM25 and vector. The cross-encoder still scores against the raw text.
Enriched 975 of the roughly 10k answer-relevant entries (15% of queries had at least one enriched target in qrels).
Covered bucket:
| Metric | baseline | RIF | RIF + enriched | Δ (enr – RIF) |
|---|---|---|---|---|
| exact_episode | 0.267 | 0.280 | 0.333 | +0.053 (+19% rel) |
| ndcg@10 | 0.350 | 0.390 | 0.473 | +0.083 (+21% rel) |
| wrong_family | 0.720 | 0.707 | 0.653 | -0.053 |
| abstain@4 | 0.280 | 0.253 | 0.227 | -0.027 (more confident) |
+8.3pp NDCG over RIF on covered queries. 3.6× larger than RIF’s own contribution. Biggest single-lever gain in the project.
The mechanism is “anticipated queries” shrinking vocabulary mismatch. When the raw memory says “switched from Postgres to SQLite for local dev” and the user later asks “what local database does the project use”, the anticipated-queries field generated at write time contains phrasings close to the query.
Cost: $1.6 for 1000 entries on Haiku. $16 projected for full coverage. Prompt caching didn’t help (system prompt too short to cache on this model).
Part 9: porting the whole thing to Rust
Once the research stack stabilized, the Python implementation became the bottleneck. Cold-start was hundreds of milliseconds before the first byte of a query, every Claude Code hook re-imported torch + sentence-transformers, and DuckDB’s Python bindings were the slowest part of --all cross-project search. So I ported it to Rust.
The result is a Cargo workspace of eight crates. lethe-core holds the retrieval and storage primitives: BM25, RRF, k-means, flat ANN, RIF, tokeniser, all hand-written. ONNX inference is ort. Persistence is duckdb. No web framework, no async runtime in core, no service. The CLI (lethe-cli) embeds a ratatui TUI (lethe-tui). Two binding crates ship the same library to PyPI (lethe-py via PyO3) and npm (lethe-node via napi-rs). One cargo build produces every artifact; one release workflow fans out to crates.io, Homebrew, PyPI, and npm in parallel.
A few engineering things were worth the effort:
- Cold ONNX load is now ~600ms and dominates warm queries. The Python equivalent paid that on every subprocess invocation; the Rust binary pays it once per
letheshell process, and zero times per query inside the TUI. - Cross-project search dropped from 6.3–7.3s to 1.7–1.8s on 9 local projects after batching the cross-encoder over the union of candidates instead of reranking per project. The Python version reranked per project and then merged.
- A shared
ort::Sessionbehind aMutexwill silently bottleneck rayon pipelines. I made that mistake in the first port. Batching across the union of candidates is much better than parallelism over per-project models. duckdb+ATTACH ... READ_ONLYis what makeslethe search --allwork without any central service. Each project keeps its own DuckDB file; the CLI ATTACHes them at query time and joins. No sync, no shared infrastructure, every project still runs fully offline.
The interactive TUI is the part I use myself. lethe with no subcommand opens it when stdout is a terminal: type to search, ↑/↓ to nav, ⏎ to open the matched chunk, --all toggles cross-project. It’s the fastest way to confirm “did we ever talk about this?”.

The Rust port also kept the research and production code in sync. A parity bench (research_playground/rust_migration/) has three suites: end-to-end LongMemEval NDCG, per-component numerical diff (BM25 / FlatIP / cross-encoder logits), and cold-start + warm-retrieve latency. Each accepts --impl python or --impl rust and emits the same JSON shape; --compare runs both and writes a markdown report. I ran it on every Rust commit that touched retrieval until deltas stayed inside the tolerance band (|ΔNDCG| ≤ 0.005, |ΔRecall| ≤ 0.005, FlatIP top-30 Jaccard ≥ 0.99, |Δxenc| ≤ 1e-3). Python is now the reference, not the production path.
Plugins
The CLI is the foundation; the daily-use shape is the plugin layer.
Claude Code plugin. A Stop hook writes a one-paragraph summary of the session into .lethe/memory/YYYY-MM-DD.md. A recall skill surfaces prior memory when Claude decides the question would benefit from history; a recall-global skill searches across every registered project at once. Install: /plugin marketplace add teimurjan/lethe && /plugin install lethe. Both skills are gated by Claude’s own decision to call them, so there’s no token cost on queries that don’t need history.
Codex CLI plugin. Same shape, different host: ~/.codex/lethe/ with hooks + skills, installer patches ~/.codex/config.toml between sentinel markers (re-runs replace the previous block, so updates are clean). Codex doesn’t expose transcript summarization yet, so the recall skills work but the auto-summarize half is pending the format settling.
Polyglot release. pip install lethe-memory ships the PyPI binding. npm install @lethe-memory/lethe ships the napi-rs one. brew install lethe and cargo install lethe-cli ship the binary. All four track the same workspace version, bumped by release-please on conventional commit titles, published in parallel by one workflow.
Part 10: The BM25 tokenizer was a punctuation tax
A code-review pass during the Rust port (Codex flagged it) caught BM25 still tokenizing with text.lower().split() from checkpoint 8. Under split(), "MongoDB?" and "MongoDB" are different tokens, so every query ending in punctuation silently missed the corresponding corpus turn.
Swept four tokenizers on a 100-query random sample:
| Tokenizer | NDCG@10 | Recall@10 | BM25 build |
|---|---|---|---|
baseline (lower().split()) | 0.302 | 0.373 | 11.2s |
regex [A-Za-z0-9_]+ | 0.339 (+3.7pp) | 0.441 (+6.8pp) | 10.4s |
| regex + stopword removal | 0.308 (+0.6pp) | 0.403 | 10.6s |
| regex + Porter stemming | 0.315 (+1.3pp) | 0.414 | 187.8s |
Two non-obvious things came out of the sweep.
Stopword removal regresses. Dropping function words (“the”, “of”, “is”) erases most of the gain. Conversational queries are short and specific; function words act as syntactic anchors, and removing them is net-harmful on this corpus. Standard IR advice from Wikipedia/news corpora doesn’t translate.
Porter stemming is a trap. Build cost jumps 17× (10s → 188s), NDCG tops out at +1.3pp, well below plain regex. Over-conflation (generate/general → gener) costs more than the vocabulary-compression gain on conversational text.
On the full 200-query headline benchmark the swap lifted the production pipeline from 0.368 → 0.382 NDCG@10 (BM25-only jumped from 0.242 to 0.317). Bigger single-lever quality win than clustered-RIF on the same corpus, measured on the same eval. The lift is punctuation-only: the 200k conversational turns are heavy on trailing ?, ., contractions, and hook-written session anchors that previously didn’t tokenize cleanly.
Side effect: re-running the RIF benchmark on this stronger baseline halved RIF’s relative gain (clustered+gap went from +6.5% to +3.4%) but improved the absolute NDCG (0.315 → 0.342) and Recall@30 (+4.8pp). Better-base-leaves-less-to-recover, the mechanism still net-positive. The cue-dependence of clustered RIF is what keeps it in the money when the base improves; global RIF is now indistinguishable from baseline.
Part 11: BM25 is the load-bearing component (six independent confirmations)
After enrichment + tokenizer + Rust port, I had a stable production stack at 0.382 NDCG@10 and a free hand to ablate the model layer. Two months and six experiments later, the conclusion is the same in every direction: BM25 is the load-bearing component on conversational long memory; the cross-encoder and bi-encoder are downstream fitters, not the limiter.
11a. Cross-encoder shootout
Held the bi-encoder at MiniLM-L6 and swapped the reranker. 50-query subsample.
| Reranker | Params | lethe_full NDCG@10 | Δ NDCG | Slowdown |
|---|---|---|---|---|
Xenova/ms-marco-MiniLM-L-6-v2 (current) | 22M | 0.376 | – | 1.0× |
Xenova/ms-marco-MiniLM-L-12-v2 | 33M | 0.378 | +0.14pp | 2.0× |
jinaai/jina-reranker-v1-turbo-en | 38M | 0.376 | −0.07pp | 1.25× |
jinaai/jina-reranker-v2-base-multilingual | 278M | 0.301 | −7.5pp | 6.8× |
Bigger models cost more and don’t deliver. Jina-v2-base-multilingual actively regresses by 7.5pp at 6.8× the cost. The “+10pt over MiniLM” claims on 2025 reranker leaderboards (mxbai-rerank-v2, jina-v2) are measured on BEIR / general IR; conversational chat memory is a different workload, and the calibration to MS-MARCO logits + BM25-shaped pools pins MiniLM-L6 on the local Pareto frontier. Bge-reranker-v2-m3 (568M) extrapolates to ~50s/query on this CPU; mxbai-rerank-base-v2 (500M Qwen2.5) ~40-60s/query. Both disqualified for interactive use without GPU.
11b. Bi-encoder swap
Held the cross-encoder at MiniLM-L6 and swapped to Xenova/bge-small-en-v1.5 (33M, 384D, CLS pooling, int8). Required new harness plumbing: BiEncoder::from_repo_full(repo, onnx_variant, pooling) + a cmd_prepare-embeddings subcommand to re-encode the 199k corpus into a parallel cache.
| Config | MiniLM | BGE-small | Δ |
|---|---|---|---|
vector_only | 0.158 | 0.191 | +3.3pp |
bm25_only | 0.358 | 0.358 | 0.0pp |
vector_xenc | 0.227 | 0.263 | +3.6pp |
lethe_full | 0.376 | 0.368 | −0.8pp |
BGE is genuinely a better embedder: vector_only +3.3pp, vector_xenc +3.6pp, exactly the magnitude MTEB / BEIR predict. The lift mostly washes out in lethe_full. Mechanism: lethe_full unions BM25(top-30) ∪ dense(top-30) → rerank top-60. On this corpus bm25_only (0.358) is 2.3× stronger than vector_only (0.158). BM25 dominates the rerank pool; better dense delivers more relevant docs (R@10 +1.7pp), but the cross-encoder, already saturating on what it can rank, can’t reorder them above the BM25-supplied entries it was already picking well.
11c. Multi-field BM25 + field-boosted BM25
The reranker + bi-encoder ablation pointed at “enrich BM25’s signal” as the remaining lever. The cheapest version: index body, entities, title separately and fuse via RRF, using regex extractors instead of LLM-generated text. URL / CamelCase / acronym / snake_case / file-path / version / hex hash / backtick spans for entities; first-non-empty-line truncated at 120 chars for title. Two formulations, both regressed.
| Config | NDCG@10 | R@10 | vs lethe_full |
|---|---|---|---|
bm25_only (body) | 0.358 | 0.431 | – |
bm25_boost_only (body + 2×entities + title) | 0.343 | 0.417 | (BM25-alone) |
lethe_full (control) | 0.376 | 0.503 | baseline |
lethe_multifield (body / entities / title / dense → equal-weight RRF → rerank) | 0.340 | 0.453 | −3.6pp / −5.0pp |
lethe_field_boost (single BM25 over body + 2×entities + title, then dense union, rerank) | 0.343 | 0.463 | −3.3pp / −4.0pp |
The decisive number is bm25_boost_only 0.343 < bm25_only 0.358: the boost degrades the BM25 index itself, before the reranker sees anything. The failure is upstream of fusion. Three reasons: (1) naive concatenation breaks BM25 length normalization (b=0.75 penalizes body-term TF when entity tokens lengthen the doc; proper multi-field needs BM25F, not concatenation); (2) equal-weight RRF over four sources dilutes the dominant body signal (entities populate only 37% of chunks since chat memory is mostly prose, so RRF promotes entity-only matches body BM25 had correctly deprioritized); (3) regex entities are the wrong vocabulary bridge: users query in natural language (“how do I configure X”), not with the syntactic tokens regex extracts (BiEncoder, 4aae737, 0.10.0).
The structural insight is what makes this informative as a negative: the kind of enrichment matters more than the field architecture. What checkpoint 17 produces (gist, anticipated_queries) succeeds for the same reason regex entities fail: anticipated queries are written in the user’s vocabulary, so they bridge the BM25 query–document gap directly. That’s the +8.3pp covered-bucket result; it can’t be cheaply substituted by regex.
11d. Late chunking with Nomic on Modal GPU
Hypothesis from Jina (2024): per-turn embeddings lose conversational context; encoding the whole session with a long-context model and mean-pooling per turn gives each turn a context-aware embedding. Cited gain on long-document benchmarks: +1.5–6.5 NDCG@10 absolute.
Local CPU encoding capped at 0.4–1.5 sessions/sec on a 139M-param nomic-embed-text-v1.5 int8, full prep ~14 hours. Built a Modal app instead: prep_late.py runs on a single L40S, 70 sessions/sec, ~3.5 min for the full 199,509-turn corpus, ~$0.30 spend. 99.98% of sessions fit in one 8192-token forward pass.
| Config | MiniLM (control) | BGE-small | nomic-v1.5 LATE | Δ vs control |
|---|---|---|---|---|
vector_only | 0.158 | 0.191 | 0.110 | −4.8pp |
bm25_only | 0.358 | 0.358 | 0.358 | 0 |
vector_xenc | 0.227 | 0.263 | 0.197 | −3.0pp |
lethe_full NDCG | 0.376 | 0.368 | 0.365 | −1.2pp |
lethe_full R@10 | 0.503 | 0.520 | 0.483 | −2.0pp |
Two things to call out. The dense leg regressed 4.8pp on vector_only because the encoding script prepended search_document: to each turn before packing into one session, so the model saw [CLS] search_document: turn1 [SEP] search_document: turn2 [SEP] .... Nomic-v1.5 was trained with the prefix appearing once per document; multiple in-document prefixes are out-of-distribution and corrupt the per-turn pooled embeddings. Easy fix (single prefix at session start, per-turn spans excluding prefix tokens). I didn’t run it because the second observation makes the fix moot: lethe_full is flat within 50q bootstrap noise (±2–3pp). Even a totally different long-context dense embedder + session-level context awareness does not move the production metric.
Third independent confirmation of the BM25-dominance pattern. Cross-encoder swap, multi-reranker shootout, BGE bi-encoder swap, multi-field BM25, field-boosted BM25, and now late chunking: all six converge on the same ceiling for lethe_full on this corpus. The Modal harness stays in the repo (research_playground/late_chunking_modal/) as a reusable “cheap GPU embed” tool; the late-chunking question itself is closed for this workload.
11e. The latency knob the TUI exposed
One non-quality lever I did pull. Checkpoint 6 set adaptive deep-pass k_deep=200 and never measured smaller alternatives. Once the TUI made cross-project search user-visible (N × k_deep rerank cost in the worst case), I swept it on the production pipeline:
| config | NDCG@10 | Recall@10 | p50 | p95 |
|---|---|---|---|---|
| shallow-only | 0.287 | 0.349 | 1689 ms | 2371 ms |
k_deep=60 | 0.291 | 0.353 | 4159 ms | 5936 ms |
k_deep=100 | 0.302 | 0.373 | 5651 ms | 7440 ms |
k_deep=200 (old) | 0.302 | 0.373 | 9800 ms | 12650 ms |
k_deep=100 matches the old 200 on quality and cuts p50 by 42%, p95 by 41%. The cross-encoder’s top-10 picks stabilize by merged rank ~100 on this workload; ranks 101-200 never won a top-10 slot, so the old 200 was pure latency tax. Shipped as the new default; the knob is exposed on the MemoryStore / UnionStore constructors.
I also tried two hardware levers that didn’t pan out: int8-quantized BGE-small (4.9× synthetic throughput, 0.23× on real conversational turns because BGE’s 512-token cap doubles per-item compute on long turns); CoreML execution provider on Apple Silicon (10.5× slower on the bi-encoder because onnxruntime’s CoreML partitioner only covers ~72% of the graph and every forward pass pays a Metal/ANE round-trip that won’t amortize on a 22M-param model). Logged in the journal so future explorers don’t rediscover.
Part 12: what shipped (current)
The stack that survived all of this:
- Storage. Markdown files under
.lethe/memory/*.md, one day per file,##/###sections as chunks. Readable withcat, editable with any editor. Content-hashed for incremental reindex. - Index. DuckDB, one file per project at
.lethe/index/lethe.duckdb. BM25 tokens (regex-tokenized), FAISS-equivalent dense vectors, k-means cluster centroids, per-cluster RIF suppression state, all in one file. No server, no external vector DB. - Cross-project search. DuckDB
ATTACH ... READ_ONLY.lethe search --allopens every registered project’s DuckDB simultaneously, fans the cross-encoder over the union of candidates (one batched rerank pass, not N), merges via RRF. - Retrieval. BM25 (regex tokenizer) + dense (MiniLM-L6 ONNX) + cross-encoder rerank (ms-marco MiniLM-L6) with
k_deep=100, clustered RIF (rank-gap, 30 clusters, query-based centroids), 0.95 cosine dedup. Optional Haiku enrichment at write time. - Distribution. Single Rust binary (
lethe, ratatui TUI when invoked with no args). PyPI binding (lethe-memory, PyO3). npm binding (lethe, napi-rs). Claude Code plugin (Stop-hook auto-summarization +recall/recall-globalskills). Codex CLI plugin (same skills, transcript summarization pending).
The name is from the Greek river of forgetting. The store is as much defined by what it suppresses as by what it retrieves.
Benchmark methodology note
The numbers above are NDCG@10 on turn-level retrieval over the full 199,509-turn LongMemEval S corpus. That is needle-in-haystack search against 200k candidates.
Other memory-tool benchmarks commonly report recall@5 on per-query fresh databases of roughly 50 sessions at session granularity. That is a roughly 2000x easier task (random baseline 10% vs 0.005%). Some implementations additionally leak ground truth via annotation fields at indexing time. Published numbers in the 95-99% range on that methodology are state-of-the-art for that methodology. They are not directly comparable to anything here.
A head-to-head on a shared methodology in either direction is a separate experiment that has not been run.
What’s next
Reranker swap, bi-encoder swap, rerank-pool widening, multi-field BM25, late chunking: all dropped off the candidate list. Six independent ablations confirm none is the limiter on this corpus. The remaining open directions are either expensive (LLM enrichment) or scope-defining (cross-dataset replication, head-to-head benchmarks).
- Scale enrichment to full answer-relevant coverage. ~$16 + a few hours on Haiku. Confirms the covered-bucket numbers at scale; should produce a clean +8pp NDCG story instead of the diluted partial-coverage one. Reinforced by the multi-field BM25 negative: cheap regex substitutes for anticipated-queries don’t carry the lift, so paying for actual LLM-generated enrichment is the only path that gives BM25 the right signal.
- Replicate clustered RIF on a second long-term conversation memory benchmark (LoCoMo, MSC, LongMemEval M). The NFCorpus negative shows the mechanism doesn’t transfer to ad-hoc IR; a second in-scope dataset would strengthen the conversational-memory claim independently.
- Head-to-head against other memory tools on shared methodology. Run lethe under their setup (per-query ~50 sessions, recall@5), and theirs under LongMemEval S. Honest comparison, honest methodology.
- Move failure modes that haven’t budged. Sibling confusion (within-session embedding granularity) and stale fact (temporal awareness). Both need different mechanisms from RIF or enrichment. Candidates: session-structured reranking, temporal-aware tie-breaking, explicit fact extraction with validity windows.
The lesson
There are two now.
The first (still): check the bottleneck before extending the mechanism. For three months I tuned a mutation layer that couldn’t matter regardless of tuning, because the ceiling it was pushing against (the cross-encoder reranker on 30 candidates) was the one not being fed the right inputs. An hour of recall measurement reframed three months of work.
The second: when one component dominates, model-layer swaps are theatre. Two months of post-enrichment ablations (better reranker, better bi-encoder, multi-field BM25, late chunking on a GPU) all bounced off the same ceiling because BM25 was supplying the candidates the reranker was already ranking well. 2025 RAG-survey advice (“just swap to mxbai-rerank-v2”, “just swap to BGE”, “just use late chunking”) implicitly assumes a balanced pipeline. Conversational long memory isn’t balanced: bm25_only is 2.3× stronger than vector_only on this corpus, so the dense leg can lift on isolated benchmarks (vector_only, vector_xenc) without ever displacing a BM25-supplied entry from the rerank pool. The lever is in what feeds BM25, not in the model on top.
Both lessons rhyme. The interesting question is rarely “which model is better”; it’s “which component is currently the rate limiter, and why.” Until you know that, every swap is noise.
This article was originally published by Teimur Gasanov on HackerNoon.
