# Recursive Language Models: How RLMs Work

**Plutonous** | May 25, 2026 | 24 min read



Tags: Recursive Language Models, RLM, Long Context, AI Research, REPL, Inference Scaling, Context Window, AI Agents, MIT OASYS, arXiv, LLM Architecture, Enterprise AI

---

RlmMentalModelGraphic,
  RlmAnimatedWorkspaceGraphic,
  RlmLoopGraphic,
  RlmTicketWorkflowGraphic,
  RlmFanoutAggregationGraphic,
  RlmAgentDifferenceGraphic,
  RlmDepthGraphic,
  RlmControllerTrainingGraphic,
  RlmIndustryStackGraphic,
  RlmProductionTraceGraphic,
  RlmDecisionGuideGraphic,
  RlmUseCaseMap
} from '@/components/mdx/rlm-visuals'

**TL;DR:** There is one research paper here: **Recursive Language Models** by Alex L. Zhang, Tim Kraska, and Omar Khattab. The idea is not a new transformer architecture. It is an inference-time scaffold around an existing model. Instead of forcing a huge prompt into the model's context window, an RLM stores the prompt in an external environment, usually a Python REPL, then lets the model inspect, slice, transform, and recursively call language models on the parts that matter.<sup><a href="#source-1">[1]</a></sup>

That sounds like a small systems trick. It is not. It changes the long-context question from "how many tokens fit?" to "what program should the model run over this information?"

---

Every AI company now sells a bigger context window. 200K tokens. 1M tokens. 2M tokens. The pitch is simple: give the model more text and it will remember more.

Anyone who has used long coding sessions, research agents, legal-document chats, or giant PDF workflows knows the uncomfortable truth. A model can technically accept a long prompt while still failing to use it well. Facts get buried. Earlier constraints become soft. Summaries erase details. The model starts treating the context like background texture instead of working memory.

Recursive Language Models are a direct attack on that failure mode. The paper's core move is blunt: stop pretending the neural network should directly ingest everything. Put the prompt somewhere the model can operate on it programmatically.

> **What The Paper Introduces**
>
> Recursive Language Models propose a general inference paradigm where long prompts live in an external environment. The model gets a handle to that environment, writes code against it, recursively queries language models over slices of the prompt, stores intermediate state, and returns a final answer.[1]


## The Problem: Long Context Is Not Working Context

There are two separate problems hiding under "long context."

The first is capacity. Can the model fit the input at all? If the prompt is 600K tokens and the model supports 272K tokens, the answer is no.

The second is use. Even if the model can fit the input, can it reason over the relevant parts, preserve fine details, and perform the right amount of semantic work? The RLM paper argues that this second problem is the more interesting one. On simple needle-in-a-haystack tasks, a model can often retrieve one fact from a huge prompt. On dense tasks where the answer depends on many lines, many documents, or many pairs of objects, performance degrades much faster.<sup><a href="#source-1">[1]</a></sup>

That distinction explains why bigger windows have not solved long-context work. A contract review, codebase migration, literature synthesis, incident postmortem, or support-ticket analysis is rarely a single needle. It is a task that requires many partial readings, cross-checks, labels, comparisons, and local decisions.

**10M+** — Token-scale evaluated


## What An RLM Actually Is

An RLM is a language model wrapped in a runtime.

The outside interface stays familiar: send a prompt, get a response. Under the hood, the prompt is loaded into an environment as a variable. The model does not initially see the whole thing. It sees metadata: the prompt length, maybe a prefix, available functions, and instructions for how to inspect the environment. Then it writes code.

That code can read slices of the prompt, search it, chunk it, call a sub-model on selected spans, save intermediate results, and eventually return a final answer from a variable. In the authors' implementation, the environment is a Python REPL, but the paper frames the environment more generally.<sup><a href="#source-1">[1]</a></sup>

The mental model is closer to a researcher with a notebook than a model with a giant memory buffer.

The important move is not "use tools." Tool use already exists. The important move is that recursive model calls become programmable inside the environment. The model can write a loop that calls another model over every section, every document, every pair of chunks, or every file that passes a filter.

That is what gives the scaffold a different shape from a chat model with a search button.

## The Basic RLM Loop

The RLM paper describes a general algorithm, then instantiates it with a Python REPL. In practice, the loop looks like this:


The root model is not being asked to hold the entire prompt in its context. It is being asked to operate a workspace.

## A Concrete Example

Imagine asking a model:

> Across these 2,000 customer support tickets, which product areas are most associated with refund threats, and what examples support each category?

A direct long-context prompt asks the model to read everything at once and answer. A compaction approach summarizes chunks, then summarizes the summaries. A retrieval approach searches for "refund" and related terms.

An RLM can write a workflow:


That workflow is simplified, but it captures the important shape. The model gets to combine symbolic filtering with semantic classification. It can call the LLM when meaning matters and use code when exact operations matter.

The strongest RLM workflows usually alternate between both.

## Why Recursion Changes The Scaling Story

Chain-of-thought reasoning made inference-time compute feel normal. Instead of asking a model to answer immediately, you let it spend more tokens thinking. RLMs push that idea into a different dimension: instead of only scaling the length of the internal reasoning trace, they scale the amount of external semantic work the system can perform.

A standard model call has one context window and one output stream. An RLM can use the root model to orchestrate many bounded model calls, keep the partial results in program variables, and perform symbolic operations around them. That makes it possible to do work proportional to the input size, or even proportional to pairs of input chunks, without placing the whole intermediate process inside one neural context window.<sup><a href="#source-1">[1]</a></sup>

This is why the paper's OOLONG-Pairs result is so revealing. The base GPT-5 and Qwen3-Coder models made almost no progress on the pairwise aggregation task, while RLM versions produced substantial F1 scores by recursively decomposing the work.<sup><a href="#source-1">[1]</a></sup>


## The Key Difference: Prompt As Environment

The cleanest idea in the paper is that arbitrarily long prompts should not be fed into the transformer directly. They should be treated as part of the environment.

That reframing is the whole paper.

In a normal LLM call, the prompt is text inside the model's temporary memory. In an RLM, the prompt is an object. It can be indexed. It can be split. It can be transformed. It can be passed by reference through a program. The root model is no longer asked to memorize everything; it is asked to decide how to inspect and delegate.

This has a practical consequence: context management becomes adaptive. A human engineer does not read a repository by pasting all files into one window. They grep, open files, inspect call sites, run tests, write notes, and zoom in when a local region matters. RLMs give a model a version of that workflow inside a single model-like API call.

> **The Actual Breakthrough**
>
> RLMs turn long-context processing from a storage problem into an execution problem. The question stops being "how many tokens fit?" and becomes "what program should the model run over this prompt?"


## Why This Is Not Just Retrieval

Retrieval is powerful when the relevant evidence can be found with a query. It is weaker when relevance is created by the reasoning process itself.

Suppose the task is "find the contract clauses that contradict each other." A search system can retrieve clauses containing the same terms. It cannot reliably know which clauses conflict until it has compared meanings across sections. Suppose the task is "count how many support tickets imply billing confusion." A keyword index will miss paraphrases. A model needs to semantically label each row, then aggregate.

RLMs can still use retrieval-like strategies. They can grep, rank, and filter. The difference is that retrieval becomes one tool inside a larger program. The root model can choose to search first, then call sub-models to classify candidates, then compare summaries, then go back to the raw prompt when a conflict appears.

## Why This Is Not Just Agents Either

Modern coding agents already place repositories on disk and let models inspect files. That looks close to RLMs. The difference is where recursion happens.

Many agents let a model choose a tool call, observe output, and continue. Some let it delegate to sub-agents. But those delegations are often explicit, sequential actions in the conversation. RLMs make model calls callable from code. That means a generated Python loop can produce hundreds or thousands of semantic sub-queries from slices of the prompt, store their results, and combine them.

This is a sharper abstraction. It treats the model as a function available to the program, not merely as the actor writing the next chat turn.


## Recursion Depth: 0, 1, 2+

The paper uses recursion depth to describe what kinds of calls are allowed.

Depth 0 means the model gets the REPL and external prompt variable, but cannot make recursive model calls. It can still search, parse, transform, and compute. This alone helps on some tasks because the context is no longer stuffed into the model window.

Depth 1 means the root model can call regular LLMs inside the REPL. This is the central setting in many of the paper's results.

Depth greater than 1 means the root can call sub-RLMs, which can themselves inspect their own environments and recurse. This is more powerful, but also easier to mismanage. More depth can amplify syntax errors, bad decompositions, and cost spikes.


The practical lesson is that more recursion is not automatically better. RLM engineering needs cost budgets, call limits, and robust execution handling.

## What The Root Model Learns To Do

The root model is the controller. Its job is not to answer immediately. Its job is to choose a decomposition strategy.

In the paper's trajectory analysis, RLMs often probe the context first, then choose a decomposition pattern. On BrowseComp-Plus, the model uses priors and programmatic narrowing to reduce the search space before launching sub-calls. On dense tasks, it may perform semantic transformations line by line or chunk by chunk.<sup><a href="#source-1">[1]</a></sup>

That behavior creates a new training target. You can train a model not only to answer, but to be a better RLM controller: inspect structure, write valid code, choose chunk sizes, limit cost, avoid redundant calls, recover from errors, and build good buffers.

The paper's RLM-Qwen3-8B experiment is important for this reason. The authors fine-tuned Qwen3-8B on 1,000 filtered RLM trajectories distilled from a larger model. The resulting RLM-Qwen3-8B substantially improved over the base Qwen3-8B when used as an RLM, even on downstream tasks outside the training domain.<sup><a href="#source-1">[1]</a></sup>


> **The Training Implication**
>
> The scarce skill may not be "knowing more facts." It may be knowing how to operate the recursive workspace: when to inspect, when to split, when to call a model, and when to stop.


## How Industry Already Uses RLM-Shaped Ideas

No major AI platform is publicly marketing "RLMs" as a product category yet. The industry is converging on the ingredients.

OpenAI's platform has hosted tools for file search, web search, and code interpreter. File Search lets models retrieve from uploaded knowledge bases through semantic and keyword search, while Code Interpreter gives the model a sandboxed Python container with uploaded or generated files.<sup><a href="#source-4">[4]</a></sup><sup><a href="#source-5">[5]</a></sup> That is not the RLM algorithm, but it is the same direction: move state and computation out of the raw prompt and into an environment the model can operate.

Anthropic's Claude Code is another close analogue. It reads codebases, edits files, runs commands, integrates through MCP, stores project instructions and memories, and can spawn subagents for side tasks that would otherwise flood the main context.<sup><a href="#source-6">[6]</a></sup><sup><a href="#source-7">[7]</a></sup> The point is not that Claude Code is secretly an RLM. The point is that serious coding agents already treat the filesystem, shell, repo history, memories, and tool outputs as external context surfaces.

Cursor is even more explicit about the context-engineering side. Its dynamic context discovery work writes long tool outputs to files, lets the agent search chat history files after summarization, loads MCP tool descriptions on demand, and treats terminal history as searchable files.<sup><a href="#source-8">[8]</a></sup> Cursor's codebase indexing research also shows why this matters commercially: semantic search over large repositories is a core driver of agent performance, and fast index reuse can cut time-to-first-query from hours to seconds on very large repos.<sup><a href="#source-9">[9]</a></sup>

Sourcegraph Cody and GitHub Copilot cloud agent show the same industry pressure from different angles. Cody pulls codebase context through Sourcegraph search, code graph signals, keyword search, and explicit context selection.<sup><a href="#source-10">[10]</a></sup> GitHub Copilot cloud agent can be assigned issues, work in the background, push changes to a pull request, and use custom agents with specialized behavior and tools.<sup><a href="#source-11">[11]</a></sup>

Cloud platforms are building the enterprise version of this stack. Microsoft Foundry Agent Service offers tool catalogs that include web search, code interpreter, file search, Azure AI Search, MCP, OpenAPI tools, browser automation, and computer use.<sup><a href="#source-12">[12]</a></sup> Amazon Bedrock Agents combine action groups, knowledge bases, prompt templates, traces, and deployment aliases.<sup><a href="#source-13">[13]</a></sup> Google Gemini Enterprise Agent Platform combines RAG, vector search, managed agents, stateful sessions, persistent memory, code execution, registry, identity, gateway, governance, and observability.<sup><a href="#source-14">[14]</a></sup>


The industry lesson is straightforward: everyone is trying to reduce prompt stuffing. RLMs give that movement a clean research abstraction. They say the prompt should be an external environment, and the model should learn to operate over it.

## The Production Architecture

If you strip away branding, production agent systems are already decomposing RLM-like behavior into five layers.

The first layer is the **execution sandbox**. The moment the model writes code, the runtime has to assume the code is untrusted. Sandboxes, permissions, network controls, and file-system isolation become part of the model architecture.

The second layer is the **external context store**. This may be a vector store, code index, filesystem, document store, chat log, terminal log, or memory directory. The important pattern is that large state lives outside the model context, and the model receives handles rather than giant copies.

The third layer is **semantic delegation**. Subagents, handoffs, function calls, and recursive sub-queries all express the same pressure: one model context should not do every piece of semantic work directly.

The fourth layer is **durable state**. LangGraph's persistence layer saves graph state as checkpoints at each execution step, enabling memory, human-in-the-loop review, time travel debugging, fault tolerance, and resumption after failures.<sup><a href="#source-15">[15]</a></sup> Without durable state, long-running recursive work becomes fragile.

The fifth layer is **observability and governance**. Enterprises do not want opaque recursive fan-out. They want traces, budgets, tool policies, identities, approval gates, and audit logs.


## Where RLMs Will Matter

RLMs are strongest when three things are true:

1. The input is too large or too dense for one reliable direct model call.
2. The relevant evidence is distributed across many regions.
3. The system must perform semantic operations on many pieces, not just retrieve a few passages.

The paper's benchmark choices make this clear. BrowseComp-Plus requires multi-hop reasoning across a large offline corpus. OOLONG requires semantically labeling and aggregating many entries. OOLONG-Pairs requires pairwise aggregation. LongBench-v2 CodeQA requires reasoning over code repositories.<sup><a href="#source-1">[1]</a></sup>

Those are not "find one sentence" problems. They are "process the corpus" problems.


## Use Case 1: Full-Codebase Understanding

Codebases are almost perfect RLM material.

A repository is already an external environment. It has files, symbols, imports, tests, docs, generated artifacts, and history. A strong engineer does not paste the repo into a chat window. They inspect structure, search call sites, read implementations, run tests, and build a working model.

An RLM can imitate that workflow inside one request: list files and dependency boundaries, search for relevant symbols, open local regions around definitions, call sub-models to summarize modules, compare behavior across call sites, and assemble an answer with citations to files and lines.

This is where industry is furthest along. Cursor builds semantic indexes over codebases. Sourcegraph Cody uses search and code graph relationships to make answers codebase-aware. GitHub Copilot cloud agent can be assigned an issue, work in the background, push a branch, and open a pull request. Claude Code reads codebases, edits files, runs commands, handles git operations, and connects external tools through MCP.<sup><a href="#source-6">[6]</a></sup><sup><a href="#source-9">[9]</a></sup><sup><a href="#source-10">[10]</a></sup><sup><a href="#source-11">[11]</a></sup>

The RLM angle is not "coding agents should exist." They already do. The RLM angle is that a coding agent could formalize repository work as recursive semantic passes: one pass to map modules, another to classify risky call paths, another to inspect tests, another to synthesize migration steps.

## Use Case 2: Deep Research Over Offline Corpora

Search engines are excellent when the web can be queried live. Enterprise research often does not look like that. The relevant material may be a private folder of PDFs, internal memos, call transcripts, emails, market reports, and tables.

RLMs can treat that corpus as the prompt environment. The root model can inspect metadata, narrow candidate documents, ask sub-models to extract claims from batches, then reconcile conflicts. This resembles retrieval-augmented generation, but retrieval is not the whole workflow.

The paper's BrowseComp-Plus setup is a good proxy: 1,000 documents are supplied as input, and the answer requires associating evidence across documents. RLM(GPT-5, depth=1) scored 91.3% in the main table, beating the compaction agent at 70.5% and CodeAct with BM25 at 51.0%.<sup><a href="#source-1">[1]</a></sup>

Industry already offers the retrieval half of this. OpenAI File Search, Microsoft Foundry File Search and Azure AI Search, Google RAG Engine and Vector Search, and Amazon Bedrock knowledge bases all help agents ground answers in private corpora.<sup><a href="#source-4">[4]</a></sup><sup><a href="#source-12">[12]</a></sup><sup><a href="#source-13">[13]</a></sup><sup><a href="#source-14">[14]</a></sup>

RLMs would sit above those systems as a planner for dense synthesis. Instead of retrieving top-k chunks once, an RLM can create a table of documents, ask sub-models to extract claims from each group, detect contradictions, go back to raw source text, and assemble a final answer with provenance.

## Use Case 3: Legal And Compliance Review

Legal work is full of long-context traps. A contract clause can matter because of another clause 80 pages later. A compliance issue may depend on definitions, exceptions, schedules, and external policy documents. Summaries are dangerous because the omitted detail is often the legal issue.

An RLM-style system could parse contract structure into sections and definitions, extract obligations and deadlines, compare clauses for inconsistency, map policies against evidence documents, keep a table of unresolved issues, and ask sub-models to review only the raw passages relevant to each issue.

> **Legal Caveat**
>
> This is an assistive workflow, not legal authority. The RLM trace may improve auditability, but any high-stakes legal conclusion still needs professional review and provenance down to exact clauses.


The missing piece in most enterprise stacks is not access to documents. It is reliable structured comparison. A legal RLM would need schemas for obligations, parties, jurisdictions, exceptions, dates, and evidence spans. The value comes from repeatedly returning to exact clauses while keeping a structured issue register outside the model context.

## Use Case 4: Customer Support And Product Intelligence

Companies sit on enormous unstructured datasets: tickets, chats, sales calls, bug reports, community posts, refund requests, and NPS comments.

The question is rarely "find the one ticket." It is usually: what are users actually complaining about? Which complaints correlate with churn? Which feature requests are phrased differently but mean the same thing? Which account segments are seeing the same failure? What changed after a release?

This is exactly where dense semantic aggregation matters. An RLM can chunk logs, label batches, maintain counters, ask follow-up sub-queries on suspicious clusters, and assemble a final report with examples. Standard summarization tends to flatten minority patterns. Retrieval tends to overrepresent obvious keywords. RLMs can preserve rare but important categories because the intermediate state can be structured and inspected.

This maps cleanly onto current enterprise agent stacks. Microsoft Foundry, Amazon Bedrock Agents, and Google Agent Platform all provide tool access, knowledge grounding, session state, memory, governance, and API integration for long-running enterprise workflows.<sup><a href="#source-12">[12]</a></sup><sup><a href="#source-13">[13]</a></sup><sup><a href="#source-14">[14]</a></sup>

## Use Case 5: Scientific And Technical Literature

Scientific review is not just summarization. It requires comparing methods, assumptions, datasets, ablations, metrics, and negative results across papers.

An RLM can operate like a literature-review assistant: extract claims and evidence tables from each paper, normalize metric names and experimental settings, detect duplicated baselines or incompatible comparisons, ask sub-models to inspect methods sections in detail, and build a final synthesis that separates evidence from speculation.

This is where a plain vector database is not enough. A vector database can find similar passages. It cannot, by itself, decide that two papers use incompatible baselines, or that a claimed improvement disappears when normalized for compute. The recursive model calls perform that semantic normalization.

## Use Case 6: Long-Horizon Agent Memory

Agent histories rot. A coding agent that has been working for hours accumulates decisions, failed attempts, constraints, partial edits, and user corrections. Compaction helps, but it can hide the very mistake that explains the current failure.

RLMs suggest a different pattern: keep the full agent trajectory outside the root context and let the model inspect it programmatically. Instead of asking "what does the summary say?", the agent can ask when a constraint first appeared, which command produced the first failing trace, whether a fix was already tried, what files changed after tests last passed, or which user instruction conflicts with the current plan.

This is already becoming a serious product concern. Claude Code stores project memories as markdown and reads detailed topic files on demand.<sup><a href="#source-16">[16]</a></sup> Cursor writes chat history and long terminal outputs to files so the agent can search them after summarization instead of trusting only a lossy summary.<sup><a href="#source-8">[8]</a></sup> LangGraph saves execution state as checkpoints so agents can resume, support human review, time-travel debug, and recover from failures.<sup><a href="#source-15">[15]</a></sup>

The RLM interpretation is that long agent memory should be queryable raw material. The model should be able to ask the history specific questions, not merely inherit a compressed story.

## Where RLMs Are Overkill

RLMs should not become a default hammer.

If the task is short, direct prompting is simpler. If the evidence is easily retrieved, RAG is cheaper and easier to operate. If the workflow is stable and known in advance, a hand-written pipeline may be more reliable than letting a model invent one. If generated code is not acceptable in the threat model, the RLM scaffold may be the wrong tool.

The strongest version of RLMs is not "use them everywhere." It is "use them where the decomposition is too input-dependent to hand-engineer and the input is too large or dense for direct context."


## The Limits Are Real

RLMs are not free. The paper is careful about this.

First, runaway sub-calls are possible. A bad decomposition strategy can call the model too many times, spend too much money, or get stuck. The authors note that RLMs add complexity and that guardrails remain underexplored.<sup><a href="#source-1">[1]</a></sup>

Second, code generation errors matter. If the root model writes broken Python, the trajectory can fail or waste iterations. The paper's trajectory analysis finds syntax errors in some RLM runs, especially with weaker or less suitable models.<sup><a href="#source-1">[1]</a></sup>

Third, sandboxing becomes a first-class safety concern. A REPL that can inspect prompts and run code is useful. A REPL that can touch secrets, networks, or production systems without strict isolation is dangerous.

The correct takeaway is not that RLMs replace long-context models. It is that context windows alone are the wrong abstraction for a large class of tasks.


The bigger story is that RLMs make "context length" feel less like a hardware spec and more like an operating-system problem. The model becomes the planner. The prompt becomes external memory. The REPL becomes the workspace. The final answer is not simply generated from a giant blob of text; it is assembled from a trace of programmatic reading.

That is why this paper is worth paying attention to. It gives the long-context race a new axis.

---

*Source: [LLM Rumors](https://www.llmrumors.com/news/recursive-language-models-context-window-breakthrough)*
