Agentic / LLM FEApr 19, 2026·8 min read·

Tool calls are not function calls

Every agent engineer writes `result = call_tool(…)` as if it were a local function. It isn't. LLM tool calls have no stack, no locals, no implicit state — every turn re-reads the entire transcript. Here's the ABI mismatch, straight from Anthropic's tool-use contract, with a side-by-side visualizer.

Step through the same 3-call task in both models. The stack on the left pushes and pops; the transcript on the right only grows:

classical function call — stack frames
orchestrate()
step = 0
orchestrate() has one local. The stack frame keeps it alive for the entire call.
LLM tool-call loop — transcript
user · Build a report for user 42.
One user turn. Nothing implicit yet.
step 0 / 3 — stack grows & collapses on the left, transcript only grows on the right.

There's a line of code in every agentic app, and it looks like this:

const summary = await callTool("summarise", { text });

It reads like a function call. await gives it the same shape as a remote RPC. The variable summary holds a value. The stack frame around it holds locals. Control resumes where it left off. The mental model that comes with this syntax is a function call, and that mental model is wrong.

An LLM tool call is a transcript entry. No stack, no implicit locals, no place for the model to "resume from". Every turn is a fresh pass over the entire conversation. The model re-reads prior assistant turns, prior tool_result blocks, and prior user messages to figure out where it is. Miss that, and you ship agents that work fine under light load and collapse under long horizons.

This post reads the primary-source contract — Anthropic's tool-use documentation — and shows side-by-side what a classical function call preserves for you that a tool call does not.

tl;dr

A function call has an implicit runtime: a stack frame holds locals, return pops the frame, the caller keeps its state. An LLM tool call has no runtime between turns. The Anthropic API re-sends the full message history on every loop iteration, and the model has to rediscover its plan, its partial results, and "where it was" from the transcript alone. If the transcript is compacted, truncated, or missing a tool_result, the state is gone — there is nothing else holding it.

The contract, restated

Anthropic's description of the agentic loop is unambiguous about where state lives:

The canonical shape is a while loop keyed on stop_reason:

 1. Send a request with your tools array and the user message.
 2. Claude responds with stop_reason: "tool_use" and one or more tool_use blocks.
 3. Execute each tool. Format the outputs as tool_result blocks.
 4. Send a new request containing the original messages, the assistant's response, and a user message with the tool_result blocks.
 5. Repeat from step 2 while stop_reason is "tool_use".

Read step 4 twice. The new request contains the original messages. Not a continuation. Not a suspended thread. A fresh call, with the conversation so far re-materialised as input. The model on the other side doesn't "wake up where it stopped". It's re-run, from scratch, against a longer transcript.

The same page makes the statelessness explicit in description of server tools:

Server-executed tools run their own loop inside Anthropic's infrastructure. A single request from your application might trigger several web searches or code executions before a response comes back. The model searches, reads results, decides to search again, and iterates until it has what it needs, all without your application participating. This internal loop has an iteration limit. If the model is still iterating when it hits the cap, the response comes back with stop_reason: "pause_turn" instead of "end_turn". A paused turn means the work isn't finished; re-send the conversation (including the paused response) to let the model continue where it left off.

"Re-send the conversation." That sentence only makes sense if the state needed to continue the loop lives in the conversation, not around it. The API cannot keep an open process for you. The only way to let the model continue is to hand it back everything it has so far.

What a function call actually gives you

Compare that to a classical function. The runtime gives you, for free, a long list of invariants:

  • A stack frame is allocated on call and deallocated on return. Its locals are addressable by name while the call is live.
  • The return address is saved; the callee cannot forget to come back.
  • Nested calls push new frames; the caller's locals are protected while the callee runs.
  • Types are checked at the boundary; a number going in is a number coming out.
  • Concurrency inside a function does not corrupt the caller's locals, because the caller's frame is untouched on the stack.

None of those invariants exist in a tool-call loop. You, the application developer, are the runtime. For the model to remember a value across turns, that value has to appear in the transcript — typically inside an assistant turn's text, a tool_use.input, or a tool_result.content. There's no other channel.

Back to the side-by-side

Advance the steps in the visualiser at the top. On the left: classical orchestrate() → fetchUser() → renderReport(). Frames push and pop; locals survive because the stack holds them. On the right: the equivalent LLM loop. Transcript grows monotonically; every new turn has to re-read the whole thing to know what came before.

Two things become visible after a few steps:

  • The classical stack shrinksfetchUser returns, its frame is popped, its locals destroyed. Anything orchestrate needed from it had to be returned explicitly.
  • The LLM transcript only grows — there is no "pop". Every prior turn is potentially useful to the next assistant turn, so it all stays, until something outside the model (compaction, truncation, context-window overflow) evicts it.

Those are opposite state models. Tool-calling apps that treat the API like the left pane will debug symptoms like "the agent forgot which file it was editing" forever, because the bug isn't in the agent's reasoning — it's in the assumption that result = call_tool(...) keeps a latent variable somewhere the model can see.

The three failure modes this explains

Once you internalise the transcript model, several whole classes of agent bug compress into one-liners:

1. Compaction amnesia. Many agent frameworks compact or summarise old turns when the context window fills up. If a critical tool_result from turn 4 is summarised down to a single sentence by turn 40, the agent can no longer re-derive the specific fact — a file path, a foreign key, a transaction id — that lived in the raw result. The "agent forgot" isn't a reasoning failure; it's a transcript failure.

Toggle between the full transcript and the compacted one to see which details the model loses:

what the model sees on turn 11
  1. userPlease refactor the auth middleware in src/auth/session.ts.
  2. assistantI'll read the file first.
  3. toolread_file("src/auth/session.ts") → 412 lines [contents inlined]load-bearing
  4. assistantHere's my plan — split into `validate()` and `refresh()`. Proposing these patches…
  5. userGo ahead.
  6. tooledit_file({ path: "src/auth/session.ts", … }) → applied 3 hunks
  7. assistantNow running tests.
  8. toolrun_tests() → 2 failures, 18 passed. [failure details inlined]load-bearing
  9. assistantTest failure mentions TokenError — need to import from /tokens. Patching…
  10. userCan you also fix the parser issue you mentioned earlier?
Turn 11 references 'the parser issue.' The model re-reads turn 8's full tool_result — the specific TokenError, the file paths, the failed assertions. It can answer precisely because the raw details are still in the transcript.
Compaction is the harness's tool for managing context window size. It's also the most common cause of "the agent forgot what it was doing" bugs. The fix is to design prompts whose load-bearing state survives compaction — surfaced in assistant text, not buried in raw tool_results.

2. Tool result shape drift. A function's return type is stable across calls. A tool's result text is not. If your fetch_user tool returns JSON one turn and a pretty-printed summary the next, the model has to re-parse both — and may succeed at one and fail at the other, silently. A tool schema is not a function signature. Tool outputs are prose the next turn must reinterpret.

3. "Nested" tool calls aren't nested. If your render_report tool internally calls fetch_user, the parent model does not see that inner call. Its transcript only contains the outer tool_use/tool_result pair. Any state the inner call discovered that the outer tool discarded is lost to the parent. There is no call stack to inspect.

The cure for all three: treat the transcript as the only persistence layer. If a fact must survive to a later turn, it must land in a message the model will read on that turn. No framework magic replaces this.

Writing for the ABI

Once you accept the mismatch, the design guidance writes itself:

  • Keep tool outputs small and explicitly named. {"user_id": 42, "plan": "pro"} survives re-reads; a 2,000-token JSON blob gets truncated or summarised, and anything relied-upon downstream vanishes.
  • Include the plan in the assistant's text, not just in the tool sequence. The model re-reads its own prior turns. An assistant turn that says "I'll do A, then B, then C" is as load-bearing as the tools themselves.
  • Don't treat pause_turn as a signal to "resume". Re-send the whole conversation — the response you got is part of it. That's all the resumption primitive you have.
  • Don't assume parallel tool calls are independent. They're sent to a single model that must synthesise all their results on the next turn. If two tools return contradicting snapshots of external state, the model doesn't hold a mutex for you.
  • Prefer prompts that reference prior turns by content, not by position. "The file you read two turns ago" works if the file is distinctive; "the previous tool's output" degrades to vibes.

The prize for getting this right is an agent that survives long horizons — 30+ tool calls, 50+ turns — without collapsing into "what was I doing?" The prize for missing it is a demo that ships, wows an exec, and falls over in the first real workload.

Primary sources