Tool Use and Function Calling: Language Models Invoking External Functions

Name: Tool Use and Function Calling: Language Models Invoking External Functions
Creator: AI Tower
Published: 2026-02-27

Category: agents-applications Updated: 2026-02-27

Toolformer (Schick et al., NeurIPS 2023): model self-supervises API call insertion; reduces perplexity on 5 tools vs baseline; ReAct (Yao et al., ICLR 2023): interleaved reasoning+actions raise HotpotQA EM from 29.0% to 35.1% and ALFWorld success from 25% to 71%.

Key Data Points
Measure	Value	Unit	Notes
ReAct HotpotQA Exact Match	35.1%	Exact Match	Yao et al. (2022): ReAct (reason+act) vs 29.0% standard prompting; +6.1% absolute on multi-hop QA
ReAct ALFWorld success rate	71%	% success	Yao et al. (2022): ReAct 71% vs 25% standard prompting; +46 points on embodied task completion
Toolformer tools	5 tools	tool types	Schick et al. (2023): calculator, calendar, Wikipedia search, machine translation, QA system
Function call JSON format	{"name": "tool_name", "arguments": {"arg": "val"}}		Standard structured output; parsed by executor; result appended as observation in context

Tool use (function calling) enables language models to invoke external functions — calculators, search engines, code interpreters, databases, and APIs — extending beyond the limitations of parametric knowledge stored in weights. The model generates a structured description of the function call; an external executor runs the function and returns the result, which is appended to the context for the next generation step.

The Tool Use Loop

User query
    ↓
LM generates: {"name": "calculator", "arguments": {"expr": "24 * 365"}}
    ↓
Executor: runs calculator("24 * 365") → 8760
    ↓
Context append: [TOOL_RESULT]: 8760
    ↓
LM continues: "There are 8760 hours in a year."

Toolformer: Self-Supervised Tool Learning (Schick et al., 2023)

Toolformer trains a model to self-supervise when and how to insert tool calls:

Sample positions in training text where a tool call might reduce prediction loss
Generate candidate API calls via few-shot prompting
Filter: keep only calls where executing the tool and inserting the result reduces loss on the following text
Fine-tune on the filtered dataset with API calls embedded inline

Tool	Example Use Case	Perplexity Benefit
Calculator	Arithmetic problems in training text	Reduces loss on following numbers
Calendar	Date arithmetic and temporal reasoning	Correct date computations
Wikipedia search	Factual entity lookups	Grounded factual claims
Machine translation	Non-English text processing	Correct multilingual handling
QA system	Knowledge retrieval	Factual question answering

ReAct vs Direct Tool Calling (Yao et al., 2022)

Approach	HotpotQA EM	ALFWorld Success
Standard prompting (no tools)	29.0%	25%
Act-only (tool calls, no reasoning)	28.7%	45%
CoT-only (reasoning, no tools)	28.7%	—
ReAct (reasoning + tool calls)	35.1%	71%

The act-only baseline (tool calls without reasoning) performs similarly to no-tools prompting on multi-hop QA, confirming that reasoning traces are essential for effective tool selection and sequencing.

Structured Output Formats

Function calls require well-formed JSON within the text generation stream:

Format	Mechanism	Parsing
Inline (Toolformer)	Special tokens wrap API call syntax	Token-level detection
Dedicated turn	Entire output is a JSON object	Message-level parsing
JSON schema constrained	Constrained decoding to valid JSON	Grammar-based sampling

Trust and Safety Boundaries

Tool use introduces a trust boundary: the model-generated function call is untrusted input to the executor. Sandbox requirements depend on tool capabilities:

Read-only tools (search, calculator): low risk; broad access acceptable
Write tools (database, email): require explicit user authorization per call
Code execution: requires full sandboxing; never run directly on host

See chain-of-thought for reasoning traces that improve tool selection quality, rag for read-only retrieval-based augmentation, and context-window for how tool results consume the available token budget.

🧠 🧠 🧠

Sources

Frequently Asked Questions

How does function calling differ from RAG?

RAG retrieves documents from a static vector index and prepends them to context before generation — it is read-only retrieval over a pre-indexed corpus. Function calling executes arbitrary code or API endpoints and returns structured results: computations (calculator, code interpreter), real-time lookups (live data, current prices), stateful writes (database updates, email sending), or multi-step workflows. Function calling is more general but requires a trust boundary — the executor must validate and sandbox what the model is permitted to invoke.

What is the ReAct prompting framework?

ReAct (Yao et al., 2022) interleaves reasoning traces with action steps: Thought → Action → Observation → Thought → Action → .... The model generates a natural language reasoning step explaining what information it needs, then a structured tool call, then incorporates the result as an observation before reasoning again. This explicit reasoning-before-action reduces errors compared to direct tool-call generation, as the model plans which tool to use and why before committing to a call.

← All AI pages · Dashboard