Recursive Language Models: How RLMs Work

TL;DR: There is one research paper here: Recursive Language Models by Alex L. Zhang, Tim Kraska, and Omar Khattab. The idea is not a new transformer architecture. It is an inference-time scaffold around an existing model. Instead of forcing a huge prompt into the model's context window, an RLM stores the prompt in an external environment, usually a Python REPL, then lets the model inspect, slice, transform, and recursively call language models on the parts that matter.^[1]

That sounds like a small systems trick. It is not. It changes the long-context question from "how many tokens fit?" to "what program should the model run over this information?"

Every AI company now sells a bigger context window. 200K tokens. 1M tokens. 2M tokens. The pitch is simple: give the model more text and it will remember more.

Anyone who has used long coding sessions, research agents, legal-document chats, or giant PDF workflows knows the uncomfortable truth. A model can technically accept a long prompt while still failing to use it well. Facts get buried. Earlier constraints become soft. Summaries erase details. The model starts treating the context like background texture instead of working memory.

Recursive Language Models are a direct attack on that failure mode. The paper's core move is blunt: stop pretending the neural network should directly ingest everything. Put the prompt somewhere the model can operate on it programmatically.

BREAKING

What The Paper Introduces

Recursive Language Models propose a general inference paradigm where long prompts live in an external environment. The model gets a handle to that environment, writes code against it, recursively queries language models over slices of the prompt, stores intermediate state, and returns a final answer.^[1]

Developing story

The Problem: Long Context Is Not Working Context

There are two separate problems hiding under "long context."

The first is capacity. Can the model fit the input at all? If the prompt is 600K tokens and the model supports 272K tokens, the answer is no.

The second is use. Even if the model can fit the input, can it reason over the relevant parts, preserve fine details, and perform the right amount of semantic work? The RLM paper argues that this second problem is the more interesting one. On simple needle-in-a-haystack tasks, a model can often retrieve one fact from a huge prompt. On dense tasks where the answer depends on many lines, many documents, or many pairs of objects, performance degrades much faster.^[1]

That distinction explains why bigger windows have not solved long-context work. A contract review, codebase migration, literature synthesis, incident postmortem, or support-ticket analysis is rarely a single needle. It is a task that requires many partial readings, cross-checks, labels, comparisons, and local decisions.

The Mental Model

RLMs stop treating long context as text the model must swallow. They treat it as a workspace the model can inspect.

Classic long-context call

Stuff the prompt into the model

Model context window

Prompt, documents, instructions, notes, and prior work all compete for one limited space.

Result: more tokens fit, but dense reasoning can still blur.

Recursive Language Model

Put the prompt in a workspace

External prompt

Stored as an addressable variable

Root model

Writes code to inspect and plan

Sub-calls

Read slices, label, compare, summarize

Result: the model chooses how to read the context.

Watch The Prompt Become A Workspace

The prompt stays outside the model. The root model scans it, selects slices, sends bounded semantic calls, and assembles structured state.

External prompt

Huge context stored as data

doc-014renewal terms mention chargeback

doc-215support ticket says cancel

doc-612billing clause conflicts

doc-904error log points to checkout

Recursive work

Small model calls over selected slices

Call A

classify refund risk

Call B

compare clauses

Call C

extract evidence

Call D

summarize cluster

Return value

Tables, labels, counters, and evidence snippets replace one overloaded prompt.

What An RLM Actually Is

An RLM is a language model wrapped in a runtime.

The outside interface stays familiar: send a prompt, get a response. Under the hood, the prompt is loaded into an environment as a variable. The model does not initially see the whole thing. It sees metadata: the prompt length, maybe a prefix, available functions, and instructions for how to inspect the environment. Then it writes code.

That code can read slices of the prompt, search it, chunk it, call a sub-model on selected spans, save intermediate results, and eventually return a final answer from a variable. In the authors' implementation, the environment is a Python REPL, but the paper frames the environment more generally.^[1]

The mental model is closer to a researcher with a notebook than a model with a giant memory buffer.

The important move is not "use tools." Tool use already exists. The important move is that recursive model calls become programmable inside the environment. The model can write a loop that calls another model over every section, every document, every pair of chunks, or every file that passes a filter.

That is what gives the scaffold a different shape from a chat model with a search button.

The Basic RLM Loop

The RLM paper describes a general algorithm, then instantiates it with a Python REPL. In practice, the loop looks like this:

The RLM Loop In One View

The model is still generating language, but it is also operating a workspace.

1Workspace

Prompt becomes data

The full prompt is stored outside the model as a variable.

2Model

Root model plans

The model sees metadata and decides how to inspect the prompt.

3Program

Code inspects slices

Generated code searches, chunks, parses, and prepares smaller jobs.

4Recursion

Sub-calls do semantic work

LLMs classify, compare, extract, or summarize selected slices.

5Output

Final answer is assembled

Intermediate variables become the response, with less prompt stuffing.

The root model is not being asked to hold the entire prompt in its context. It is being asked to operate a workspace.

A Concrete Example

Imagine asking a model:

Across these 2,000 customer support tickets, which product areas are most associated with refund threats, and what examples support each category?

A direct long-context prompt asks the model to read everything at once and answer. A compaction approach summarizes chunks, then summarizes the summaries. A retrieval approach searches for "refund" and related terms.

An RLM can write a workflow:

The Support-Ticket Workflow

The useful pattern is not the Python syntax. It is the alternation between exact operations and semantic model calls.

Input

2,000 messy tickets

Refund threat after renewal email

Billing confusion after plan change

Feature request with no churn risk

Chargeback warning after duplicate payment

1Exact pass

Split and filter

Code separates 2,000 tickets and keeps candidates with refund, billing, chargeback, cancel, or money-back signals.

2Semantic pass

Batch labels

Model calls label each candidate by product area, severity, and supporting quote.

3Structured state

Build rows

The workspace stores JSON-like rows instead of relying on a loose summary.

4Synthesis

Answer with evidence

A final model call ranks product areas and pulls representative examples from the rows.

Area

Severity

Evidence

Billing

High

chargeback after renewal

Mobile app

Medium

cancel if sync fails

Checkout

High

refund request after duplicate payment

That workflow is simplified, but it captures the important shape. The model gets to combine symbolic filtering with semantic classification. It can call the LLM when meaning matters and use code when exact operations matter.

The strongest RLM workflows usually alternate between both.

Why Recursion Changes The Scaling Story

Chain-of-thought reasoning made inference-time compute feel normal. Instead of asking a model to answer immediately, you let it spend more tokens thinking. RLMs push that idea into a different dimension: instead of only scaling the length of the internal reasoning trace, they scale the amount of external semantic work the system can perform.

A standard model call has one context window and one output stream. An RLM can use the root model to orchestrate many bounded model calls, keep the partial results in program variables, and perform symbolic operations around them. That makes it possible to do work proportional to the input size, or even proportional to pairs of input chunks, without placing the whole intermediate process inside one neural context window.^[1]

This is why the paper's OOLONG-Pairs result is so revealing. The base GPT-5 and Qwen3-Coder models made almost no progress on the pairwise aggregation task, while RLM versions produced substantial F1 scores by recursively decomposing the work.^[1]

Why Recursion Scales The Work

The root call does not read everything directly. It fans out bounded jobs, then collapses the results into a smaller working table.

Root model

Choose decomposition

Decide which slices need semantic work.

Slice 1

contract terms

Slice 2

support tickets

Slice 3

logs and metrics

Slice 4

prior decisions

Slice 5

repo files

Slice 6

research notes

Collapsed state

Working table

labels

evidence

counts

conflicts

Final answer reads the table, not the whole corpus.

The Key Difference: Prompt As Environment

The cleanest idea in the paper is that arbitrarily long prompts should not be fed into the transformer directly. They should be treated as part of the environment.

That reframing is the whole paper.

In a normal LLM call, the prompt is text inside the model's temporary memory. In an RLM, the prompt is an object. It can be indexed. It can be split. It can be transformed. It can be passed by reference through a program. The root model is no longer asked to memorize everything; it is asked to decide how to inspect and delegate.

This has a practical consequence: context management becomes adaptive. A human engineer does not read a repository by pasting all files into one window. They grep, open files, inspect call sites, run tests, write notes, and zoom in when a local region matters. RLMs give a model a version of that workflow inside a single model-like API call.

KEY FINDING

The Actual Breakthrough

RLMs turn long-context processing from a storage problem into an execution problem. The question stops being "how many tokens fit?" and becomes "what program should the model run over this prompt?"

Why This Is Not Just Retrieval

Retrieval is powerful when the relevant evidence can be found with a query. It is weaker when relevance is created by the reasoning process itself.

Suppose the task is "find the contract clauses that contradict each other." A search system can retrieve clauses containing the same terms. It cannot reliably know which clauses conflict until it has compared meanings across sections. Suppose the task is "count how many support tickets imply billing confusion." A keyword index will miss paraphrases. A model needs to semantically label each row, then aggregate.

RLMs can still use retrieval-like strategies. They can grep, rank, and filter. The difference is that retrieval becomes one tool inside a larger program. The root model can choose to search first, then call sub-models to classify candidates, then compare summaries, then go back to the raw prompt when a conflict appears.

Why This Is Not Just Agents Either

Modern coding agents already place repositories on disk and let models inspect files. That looks close to RLMs. The difference is where recursion happens.

Many agents let a model choose a tool call, observe output, and continue. Some let it delegate to sub-agents. But those delegations are often explicit, sequential actions in the conversation. RLMs make model calls callable from code. That means a generated Python loop can produce hundreds or thousands of semantic sub-queries from slices of the prompt, store their results, and combine them.

This is a sharper abstraction. It treats the model as a function available to the program, not merely as the actor writing the next chat turn.

Agent Loop vs RLM Loop

The difference is where repeated semantic work happens: one chat turn at a time, or inside a generated program.

Typical agent scaffold

Sequential conversation loop

Model chooses next action

Tool returns observation

Conversation history grows

Next turn repairs or continues

Strong for interactive tool use, but repeated sub-work often expands the chat history.

Recursive Language Model

Programmatic semantic loop

Root writes a small program

Program loops over prompt slices

Model calls act like functions

Variables hold structured state

Strong for dense batch work because the program can launch many bounded model calls.

Recursion Depth: 0, 1, 2+

The paper uses recursion depth to describe what kinds of calls are allowed.

Depth 0 means the model gets the REPL and external prompt variable, but cannot make recursive model calls. It can still search, parse, transform, and compute. This alone helps on some tasks because the context is no longer stuffed into the model window.

Depth 1 means the root model can call regular LLMs inside the REPL. This is the central setting in many of the paper's results.

Depth greater than 1 means the root can call sub-RLMs, which can themselves inspect their own environments and recurse. This is more powerful, but also easier to mismanage. More depth can amplify syntax errors, bad decompositions, and cost spikes.

Recursion Depth Without The Jargon

Depth controls how much delegation the workspace is allowed to do.

Depth 0

Code only

Can do

Search, parse, split, count, and transform the external prompt.

Watch for

Cannot delegate semantic reading to sub-model calls.

Depth 1

Code plus LLM calls

Can do

Call ordinary LLMs on selected prompt slices from inside the program.

Watch for

Needs call budgets, good chunking, and structured prompts.

Depth 2+

RLMs call RLMs

Can do

Subtasks can run their own recursive workspace loops.

Watch for

Powerful, but cost and errors can compound quickly.

The practical lesson is that more recursion is not automatically better. RLM engineering needs cost budgets, call limits, and robust execution handling.

What The Root Model Learns To Do

The root model is the controller. Its job is not to answer immediately. Its job is to choose a decomposition strategy.

In the paper's trajectory analysis, RLMs often probe the context first, then choose a decomposition pattern. On BrowseComp-Plus, the model uses priors and programmatic narrowing to reduce the search space before launching sub-calls. On dense tasks, it may perform semantic transformations line by line or chunk by chunk.^[1]

That behavior creates a new training target. You can train a model not only to answer, but to be a better RLM controller: inspect structure, write valid code, choose chunk sizes, limit cost, avoid redundant calls, recover from errors, and build good buffers.

The paper's RLM-Qwen3-8B experiment is important for this reason. The authors fine-tuned Qwen3-8B on 1,000 filtered RLM trajectories distilled from a larger model. The resulting RLM-Qwen3-8B substantially improved over the base Qwen3-8B when used as an RLM, even on downstream tasks outside the training domain.^[1]

What The Controller Learns

Training an RLM controller means training the model to run the workspace well, not just to know the final answer.

Probe

Inspect metadata, prefixes, indexes, and sample regions before committing to a plan.

Plan

Choose whether to search, chunk, pair, cluster, summarize, or recurse.

Budget

Set limits on calls, tokens, chunk sizes, retries, and recursion depth.

Repair

Read execution errors, fix code, and preserve useful intermediate state.

Stop

Return when the evidence is sufficient instead of launching more work.

Training signal

Useful examples are full trajectories: code, slices inspected, sub-call prompts, errors, intermediate buffers, and final answers.

KEY FINDING

The Training Implication

The scarce skill may not be "knowing more facts." It may be knowing how to operate the recursive workspace: when to inspect, when to split, when to call a model, and when to stop.

How Industry Already Uses RLM-Shaped Ideas

No major AI platform is publicly marketing "RLMs" as a product category yet. The industry is converging on the ingredients.

OpenAI's platform has hosted tools for file search, web search, and code interpreter. File Search lets models retrieve from uploaded knowledge bases through semantic and keyword search, while Code Interpreter gives the model a sandboxed Python container with uploaded or generated files.^[4]^[5] That is not the RLM algorithm, but it is the same direction: move state and computation out of the raw prompt and into an environment the model can operate.

Anthropic's Claude Code is another close analogue. It reads codebases, edits files, runs commands, integrates through MCP, stores project instructions and memories, and can spawn subagents for side tasks that would otherwise flood the main context.^[6]^[7] The point is not that Claude Code is secretly an RLM. The point is that serious coding agents already treat the filesystem, shell, repo history, memories, and tool outputs as external context surfaces.

Cursor is even more explicit about the context-engineering side. Its dynamic context discovery work writes long tool outputs to files, lets the agent search chat history files after summarization, loads MCP tool descriptions on demand, and treats terminal history as searchable files.^[8] Cursor's codebase indexing research also shows why this matters commercially: semantic search over large repositories is a core driver of agent performance, and fast index reuse can cut time-to-first-query from hours to seconds on very large repos.^[9]

Industry today

traces, permissions, budgets, review gates

RLM role

Recursive fan-out needs policy and observability.

The industry lesson is straightforward: everyone is trying to reduce prompt stuffing. RLMs give that movement a clean research abstraction. They say the prompt should be an external environment, and the model should learn to operate over it.

The Production Architecture

If you strip away branding, production agent systems are already decomposing RLM-like behavior into five layers.

The first layer is the execution sandbox. The moment the model writes code, the runtime has to assume the code is untrusted. Sandboxes, permissions, network controls, and file-system isolation become part of the model architecture.

The second layer is the external context store. This may be a vector store, code index, filesystem, document store, chat log, terminal log, or memory directory. The important pattern is that large state lives outside the model context, and the model receives handles rather than giant copies.

The third layer is semantic delegation. Subagents, handoffs, function calls, and recursive sub-queries all express the same pressure: one model context should not do every piece of semantic work directly.

The fourth layer is durable state. LangGraph's persistence layer saves graph state as checkpoints at each execution step, enabling memory, human-in-the-loop review, time travel debugging, fault tolerance, and resumption after failures.^[15] Without durable state, long-running recursive work becomes fragile.

The fifth layer is observability and governance. Enterprises do not want opaque recursive fan-out. They want traces, budgets, tool policies, identities, approval gates, and audit logs.

The Production Trace

In production, the visible answer is only the end of the run. The system needs a trace of each environment read, model call, and control check.

Sandbox

Run generated code with file and network limits.

Context store

Read handles to files, indexes, logs, and memories.

Semantic calls

Delegate selected slices to bounded model calls.

State

Save labels, tables, errors, and partial answers.

Governance

Apply budgets, approvals, traces, and audit logs.

Where RLMs Will Matter

RLMs are strongest when three things are true:

The input is too large or too dense for one reliable direct model call.
The relevant evidence is distributed across many regions.
The system must perform semantic operations on many pieces, not just retrieve a few passages.

Codebases are almost perfect RLM material.

A repository is already an external environment. It has files, symbols, imports, tests, docs, generated artifacts, and history. A strong engineer does not paste the repo into a chat window. They inspect structure, search call sites, read implementations, run tests, and build a working model.

An RLM can imitate that workflow inside one request: list files and dependency boundaries, search for relevant symbols, open local regions around definitions, call sub-models to summarize modules, compare behavior across call sites, and assemble an answer with citations to files and lines.

This is where industry is furthest along. Cursor builds semantic indexes over codebases. Sourcegraph Cody uses search and code graph relationships to make answers codebase-aware. GitHub Copilot cloud agent can be assigned an issue, work in the background, push a branch, and open a pull request. Claude Code reads codebases, edits files, runs commands, handles git operations, and connects external tools through MCP.^[6]^[9]^[10]^[11]

The RLM angle is not "coding agents should exist." They already do. The RLM angle is that a coding agent could formalize repository work as recursive semantic passes: one pass to map modules, another to classify risky call paths, another to inspect tests, another to synthesize migration steps.

Use Case 2: Deep Research Over Offline Corpora

Search engines are excellent when the web can be queried live. Enterprise research often does not look like that. The relevant material may be a private folder of PDFs, internal memos, call transcripts, emails, market reports, and tables.

RLMs can treat that corpus as the prompt environment. The root model can inspect metadata, narrow candidate documents, ask sub-models to extract claims from batches, then reconcile conflicts. This resembles retrieval-augmented generation, but retrieval is not the whole workflow.

The paper's BrowseComp-Plus setup is a good proxy: 1,000 documents are supplied as input, and the answer requires associating evidence across documents. RLM(GPT-5, depth=1) scored 91.3% in the main table, beating the compaction agent at 70.5% and CodeAct with BM25 at 51.0%.^[1]

Industry already offers the retrieval half of this. OpenAI File Search, Microsoft Foundry File Search and Azure AI Search, Google RAG Engine and Vector Search, and Amazon Bedrock knowledge bases all help agents ground answers in private corpora.^[4]^[12]^[13]^[14]

RLMs would sit above those systems as a planner for dense synthesis. Instead of retrieving top-k chunks once, an RLM can create a table of documents, ask sub-models to extract claims from each group, detect contradictions, go back to raw source text, and assemble a final answer with provenance.

Use Case 3: Legal And Compliance Review

Legal work is full of long-context traps. A contract clause can matter because of another clause 80 pages later. A compliance issue may depend on definitions, exceptions, schedules, and external policy documents. Summaries are dangerous because the omitted detail is often the legal issue.

An RLM-style system could parse contract structure into sections and definitions, extract obligations and deadlines, compare clauses for inconsistency, map policies against evidence documents, keep a table of unresolved issues, and ask sub-models to review only the raw passages relevant to each issue.

WARNING

Legal Caveat

This is an assistive workflow, not legal authority. The RLM trace may improve auditability, but any high-stakes legal conclusion still needs professional review and provenance down to exact clauses.

The missing piece in most enterprise stacks is not access to documents. It is reliable structured comparison. A legal RLM would need schemas for obligations, parties, jurisdictions, exceptions, dates, and evidence spans. The value comes from repeatedly returning to exact clauses while keeping a structured issue register outside the model context.

Use Case 4: Customer Support And Product Intelligence

Companies sit on enormous unstructured datasets: tickets, chats, sales calls, bug reports, community posts, refund requests, and NPS comments.

The question is rarely "find the one ticket." It is usually: what are users actually complaining about? Which complaints correlate with churn? Which feature requests are phrased differently but mean the same thing? Which account segments are seeing the same failure? What changed after a release?

This is exactly where dense semantic aggregation matters. An RLM can chunk logs, label batches, maintain counters, ask follow-up sub-queries on suspicious clusters, and assemble a final report with examples. Standard summarization tends to flatten minority patterns. Retrieval tends to overrepresent obvious keywords. RLMs can preserve rare but important categories because the intermediate state can be structured and inspected.

This maps cleanly onto current enterprise agent stacks. Microsoft Foundry, Amazon Bedrock Agents, and Google Agent Platform all provide tool access, knowledge grounding, session state, memory, governance, and API integration for long-running enterprise workflows.^[12]^[13]^[14]

Use Case 5: Scientific And Technical Literature

Scientific review is not just summarization. It requires comparing methods, assumptions, datasets, ablations, metrics, and negative results across papers.

An RLM can operate like a literature-review assistant: extract claims and evidence tables from each paper, normalize metric names and experimental settings, detect duplicated baselines or incompatible comparisons, ask sub-models to inspect methods sections in detail, and build a final synthesis that separates evidence from speculation.

This is where a plain vector database is not enough. A vector database can find similar passages. It cannot, by itself, decide that two papers use incompatible baselines, or that a claimed improvement disappears when normalized for compute. The recursive model calls perform that semantic normalization.

Use Case 6: Long-Horizon Agent Memory

Agent histories rot. A coding agent that has been working for hours accumulates decisions, failed attempts, constraints, partial edits, and user corrections. Compaction helps, but it can hide the very mistake that explains the current failure.

RLMs suggest a different pattern: keep the full agent trajectory outside the root context and let the model inspect it programmatically. Instead of asking "what does the summary say?", the agent can ask when a constraint first appeared, which command produced the first failing trace, whether a fix was already tried, what files changed after tests last passed, or which user instruction conflicts with the current plan.

This is already becoming a serious product concern. Claude Code stores project memories as markdown and reads detailed topic files on demand.^[16] Cursor writes chat history and long terminal outputs to files so the agent can search them after summarization instead of trusting only a lossy summary.^[8] LangGraph saves execution state as checkpoints so agents can resume, support human review, time-travel debug, and recover from failures.^[15]

The RLM interpretation is that long agent memory should be queryable raw material. The model should be able to ask the history specific questions, not merely inherit a compressed story.

Where RLMs Are Overkill

RLMs should not become a default hammer.

RLMs are not free. The paper is careful about this.

First, runaway sub-calls are possible. A bad decomposition strategy can call the model too many times, spend too much money, or get stuck. The authors note that RLMs add complexity and that guardrails remain underexplored.^[1]

Second, code generation errors matter. If the root model writes broken Python, the trajectory can fail or waste iterations. The paper's trajectory analysis finds syntax errors in some RLM runs, especially with weaker or less suitable models.^[1]

Third, sandboxing becomes a first-class safety concern. A REPL that can inspect prompts and run code is useful. A REPL that can touch secrets, networks, or production systems without strict isolation is dangerous.

The correct takeaway is not that RLMs replace long-context models. It is that context windows alone are the wrong abstraction for a large class of tasks.

The bigger story is that RLMs make "context length" feel less like a hardware spec and more like an operating-system problem. The model becomes the planner. The prompt becomes external memory. The REPL becomes the workspace. The final answer is not simply generated from a giant blob of text; it is assembled from a trace of programmatic reading.

That is why this paper is worth paying attention to. It gives the long-context race a new axis.

Sources & References

Key sources and references used in this article

#	Source	Outlet	Date	Key Takeaway
1	Recursive Language Models	arXiv Alex L. Zhang, Tim Kraska, Omar Khattab	May 11, 2026	Primary technical source for RLM definitions, benchmark results, recursion depth, training results, trajectory behavior, and limitations.
2	Recursive Language Models	Alex L. Zhang Blog Alex L. Zhang	Oct 2025	Author explanation of the RLM framing, API replacement idea, REPL setup, and context-rot motivation.
3	alexzhang13/rlm	GitHub MIT OASYS Lab	2026	Public implementation and documentation for using RLMs with several model providers and sandbox environments.
4	File search	OpenAI Docs	2026	Hosted file search lets models retrieve from uploaded knowledge bases through semantic and keyword search.
5	Code Interpreter	OpenAI Docs	2026	Code Interpreter gives models a sandboxed Python container with uploaded and generated files.
6	Claude Code Overview	Anthropic Docs	2026	Claude Code reads codebases, edits files, runs commands, integrates tools, manages memory, and automates coding work.
7	Claude Code Subagents	Anthropic Docs	2026	Subagents preserve context by moving exploration and specialized work into separate context windows.
8	Dynamic context discovery	Cursor Research Jediah Katz	Jan 6, 2026	Cursor describes storing long tool outputs, chat history, MCP tools, and terminal sessions as searchable files.
9	Securely indexing large codebases	Cursor Research Jeremy Stribling	Jan 27, 2026	Cursor's codebase indexing research shows semantic search is central to agent performance on large repositories.
10	Cody Context	Sourcegraph Docs	2026	Cody retrieves context through keyword search, Sourcegraph Search, code graph relationships, and explicit context selection.
11	Starting GitHub Copilot sessions	GitHub Docs	2026	GitHub Copilot cloud agent can be assigned tasks, work in the background, and produce pull requests for review.
12	Agent tools overview for Microsoft Foundry Agent Service	Microsoft Learn	Mar 13, 2026	Microsoft's agent tool catalog includes web search, code interpreter, file search, custom functions, MCP, browser automation, and computer use.
13	Automate tasks in your application using AI agents	AWS Docs	2026	Amazon Bedrock Agents combine action groups, knowledge bases, orchestration prompts, traces, and deployment aliases.
14	Gemini Enterprise Agent Platform overview	Google Cloud Docs	May 22, 2026	Google's platform combines agent development, RAG, vector search, managed sandboxes, persistent memory, code execution, governance, and observability.
15	LangGraph persistence	LangChain Docs	2026	LangGraph checkpointing supports memory, human-in-the-loop review, time travel debugging, and fault-tolerant resumption.
16	How Claude remembers your project	Anthropic Docs	2026	Claude Code stores project memory as markdown files and reads detailed topic files on demand.

16 sourcesClick any row to visit original

What The Paper Introduces

The Problem: Long Context Is Not Working Context

The Mental Model

Stuff the prompt into the model

Put the prompt in a workspace

Watch The Prompt Become A Workspace

Huge context stored as data

Small model calls over selected slices

What An RLM Actually Is

The Basic RLM Loop

The RLM Loop In One View

Prompt becomes data

Root model plans

Code inspects slices

Sub-calls do semantic work

Final answer is assembled

A Concrete Example

The Support-Ticket Workflow

2,000 messy tickets

Split and filter

Batch labels

Build rows

Answer with evidence

Why Recursion Changes The Scaling Story

Why Recursion Scales The Work

Choose decomposition

Working table

Why The Results Matter

The Key Difference: Prompt As Environment

The Actual Breakthrough

Why This Is Not Just Retrieval

Why This Is Not Just Agents Either

Agent Loop vs RLM Loop

Sequential conversation loop

Programmatic semantic loop

Recursion Depth: 0, 1, 2+

Recursion Depth Without The Jargon

Code only

Code plus LLM calls

RLMs call RLMs

What The Root Model Learns To Do

What The Controller Learns

Probe

Plan

Budget

Repair

Stop

The Training Implication

How Industry Already Uses RLM-Shaped Ideas

How Industry Maps To RLMs

User-facing agent

Orchestration

External context

Execution

Controls

The Production Architecture

The Production Trace

Sandbox

Context store

Semantic calls

State

Governance

Engineering Rules For Practical RLMs

Set hard budgets

Use isolated execution

Prefer structured buffers

Log trajectories

Train the controller

Where RLMs Will Matter

Where RLMs Help Most

Codebases

Research corpora

Legal review

Support data

Scientific literature

Agent memory

Use Case 1: Full-Codebase Understanding

Use Case 2: Deep Research Over Offline Corpora

Use Case 3: Legal And Compliance Review

Legal Caveat