Tool Use and Function Calling: Language Models Invoking External Functions
Toolformer (Schick et al., NeurIPS 2023): model self-supervises API call insertion; reduces perplexity on 5 tools vs baseline; ReAct (Yao et al., ICLR 2023): interleaved reasoning+actions raise HotpotQA EM from 29.0% to 35.1% and ALFWorld success from 25% to 71%.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| ReAct HotpotQA Exact Match | 35.1% | Exact Match | Yao et al. (2022): ReAct (reason+act) vs 29.0% standard prompting; +6.1% absolute on multi-hop QA |
| ReAct ALFWorld success rate | 71% | % success | Yao et al. (2022): ReAct 71% vs 25% standard prompting; +46 points on embodied task completion |
| Toolformer tools | 5 tools | tool types | Schick et al. (2023): calculator, calendar, Wikipedia search, machine translation, QA system |
| Function call JSON format | {"name": "tool_name", "arguments": {"arg": "val"}} | Standard structured output; parsed by executor; result appended as observation in context |
Tool use (function calling) enables language models to invoke external functions — calculators, search engines, code interpreters, databases, and APIs — extending beyond the limitations of parametric knowledge stored in weights. The model generates a structured description of the function call; an external executor runs the function and returns the result, which is appended to the context for the next generation step.
The Tool Use Loop
User query
↓
LM generates: {"name": "calculator", "arguments": {"expr": "24 * 365"}}
↓
Executor: runs calculator("24 * 365") → 8760
↓
Context append: [TOOL_RESULT]: 8760
↓
LM continues: "There are 8760 hours in a year."
Toolformer: Self-Supervised Tool Learning (Schick et al., 2023)
Toolformer trains a model to self-supervise when and how to insert tool calls:
- Sample positions in training text where a tool call might reduce prediction loss
- Generate candidate API calls via few-shot prompting
- Filter: keep only calls where executing the tool and inserting the result reduces loss on the following text
- Fine-tune on the filtered dataset with API calls embedded inline
| Tool | Example Use Case | Perplexity Benefit |
|---|---|---|
| Calculator | Arithmetic problems in training text | Reduces loss on following numbers |
| Calendar | Date arithmetic and temporal reasoning | Correct date computations |
| Wikipedia search | Factual entity lookups | Grounded factual claims |
| Machine translation | Non-English text processing | Correct multilingual handling |
| QA system | Knowledge retrieval | Factual question answering |
ReAct vs Direct Tool Calling (Yao et al., 2022)
| Approach | HotpotQA EM | ALFWorld Success |
|---|---|---|
| Standard prompting (no tools) | 29.0% | 25% |
| Act-only (tool calls, no reasoning) | 28.7% | 45% |
| CoT-only (reasoning, no tools) | 28.7% | — |
| ReAct (reasoning + tool calls) | 35.1% | 71% |
The act-only baseline (tool calls without reasoning) performs similarly to no-tools prompting on multi-hop QA, confirming that reasoning traces are essential for effective tool selection and sequencing.
Structured Output Formats
Function calls require well-formed JSON within the text generation stream:
| Format | Mechanism | Parsing |
|---|---|---|
| Inline (Toolformer) | Special tokens wrap API call syntax | Token-level detection |
| Dedicated turn | Entire output is a JSON object | Message-level parsing |
| JSON schema constrained | Constrained decoding to valid JSON | Grammar-based sampling |
Trust and Safety Boundaries
Tool use introduces a trust boundary: the model-generated function call is untrusted input to the executor. Sandbox requirements depend on tool capabilities:
- Read-only tools (search, calculator): low risk; broad access acceptable
- Write tools (database, email): require explicit user authorization per call
- Code execution: requires full sandboxing; never run directly on host
See chain-of-thought for reasoning traces that improve tool selection quality, rag for read-only retrieval-based augmentation, and context-window for how tool results consume the available token budget.
Related Pages
Sources
- Schick et al. (2023) — Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023
- Yao et al. (2022) — ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023
- Nakano et al. (2021) — WebGPT: Browser-assisted Question-Answering with Human Feedback. arXiv
Frequently Asked Questions
How does function calling differ from RAG?
RAG retrieves documents from a static vector index and prepends them to context before generation — it is read-only retrieval over a pre-indexed corpus. Function calling executes arbitrary code or API endpoints and returns structured results: computations (calculator, code interpreter), real-time lookups (live data, current prices), stateful writes (database updates, email sending), or multi-step workflows. Function calling is more general but requires a trust boundary — the executor must validate and sandbox what the model is permitted to invoke.
What is the ReAct prompting framework?
ReAct (Yao et al., 2022) interleaves reasoning traces with action steps: Thought → Action → Observation → Thought → Action → .... The model generates a natural language reasoning step explaining what information it needs, then a structured tool call, then incorporates the result as an observation before reasoning again. This explicit reasoning-before-action reduces errors compared to direct tool-call generation, as the model plans which tool to use and why before committing to a call.