Claude Agent SDK: Build Tool-Using AI Agents from Scratch
9 min read

Claude Agent SDK: Build Tool-Using AI Agents from Scratch

Build tool-using AI agents with Anthropic SDK. Covers tool definitions, multi-tool calls, error handling, and cost optimization in hands-on Python code.

I ran into the Tool Use moment while building a FastAPI streaming backend with the Claude API. The trigger was simple. A user asked “how many days are left in this year?” and Claude answered wrong. Not just wrong, but confidently wrong. I remember thinking, “OK, a chatbot can’t handle this.”

Tool Use fixes that structurally. Instead of the model calculating directly, it calls a calculation function and uses the result to answer. That difference is what separates a chatbot from an agent.

What follows are the Tool Use patterns I validated by directly installing and running anthropic SDK 0.101.0. Basic tool definitions, the agentic loop, error handling, cost. Everything here is based on code I actually ran.

Why Tool Use Is Different from a Chatbot: The Structural Gap

An LLM samples tokens from a probability distribution. Tasks like date arithmetic, precise numerical calculations, or live API lookups are structurally unreliable. The model recreates patterns from training data, not ground truth.

Tool Use addresses this at a different layer. The model decides what to do, and actual execution is delegated to external code. Instead of computing directly, the model emits something like calculate("365 - today.day_of_year"), and Python runs it and returns the result.

# Chatbot: model answers directly
# "Doesn't know today's date, has to compute directly -> can be wrong"
response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "How many days left in this year?"}]
)

# Agent: delegates to a tool
# "Model picks the tool, Python computes accurately"
response = client.messages.create(
    model="claude-opus-4-7",
    tools=tools,  # includes date calculation tool
    messages=[{"role": "user", "content": "How many days left in this year?"}]
)

The decisive difference is reliability. Python’s datetime module doesn’t get dates wrong.

Installing anthropic 0.101.0 and Initializing the Client

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install anthropic

Results from running this directly in a temp directory:

anthropic version: 0.101.0
Client instantiated: ✓
Client type: Anthropic

0.101.0 is the latest as of 2026-05-13. This is the official Anthropic SDK. It’s completely different from packages like pyautogen that were common before 2025.

import anthropic
import json
from typing import Any

client = anthropic.Anthropic(api_key="your-api-key")  # or set ANTHROPIC_API_KEY env var

The SDK auto-loads the API key from ANTHROPIC_API_KEY. Don’t hard-code it.

Defining Your First Tool: JSON Schema Is All You Need

Tool Use uses a structure similar to OpenAI Function Calling. Each tool has three parts:

  • name: Tool identifier (like a function name)
  • description: The basis for the model’s decision on when to use this tool
  • input_schema: JSON Schema for input parameters
tools = [
    {
        "name": "get_current_date_info",
        "description": "Returns current date and time information. Use for questions about 'today', 'now', or anything requiring current date knowledge.",
        "input_schema": {
            "type": "object",
            "properties": {
                "timezone": {
                    "type": "string",
                    "description": "IANA timezone (e.g. America/New_York, Asia/Seoul). Default: UTC"
                }
            },
            "required": []
        }
    },
    {
        "name": "calculate",
        "description": "Performs mathematical operations. Handles addition, subtraction, multiplication, division, exponentiation, and modulo.",
        "input_schema": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["add", "subtract", "multiply", "divide", "power", "modulo"],
                    "description": "The operation to perform"
                },
                "a": {"type": "number", "description": "First operand"},
                "b": {"type": "number", "description": "Second operand"}
            },
            "required": ["operation", "a", "b"]
        }
    }
]

The description field matters more than it looks. The model reads only the description to decide whether to use this tool. When I tested with vague descriptions, the model picked the wrong tool or skipped it entirely.

Validated tool definition structure from my sandbox:

Tool: get_current_date_info
  Description: Returns current date info
  Required params: []

Tool: calculate
  Description: Performs math operations
  Required params: ['operation', 'a', 'b']

Implementing the Agentic Loop: A Cycle of Calls and Responses

Agentic loop diagram — flow from user message through tool execution to result return

This is the core. Tool Use doesn’t finish in a single API call. When the model calls a tool → we execute it → we feed the result back. This cycle repeats until the model returns end_turn.

def run_agent(user_message: str, tools: list, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": user_message}]
    
    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        
        # No tool call — return the final answer
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
        
        # Handle tool calls
        if response.stop_reason == "tool_use":
            # Add the full assistant response to messages (including tool calls)
            messages.append({"role": "assistant", "content": response.content})
            
            # Collect all tool results and add together
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = process_tool_call(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            
            # Tool results go under the "user" role (API requirement)
            messages.append({"role": "user", "content": tool_results})
    
    return "Max iterations exceeded"

Two things are easy to miss here.

First, add the entire response.content to messages, not just the text block. The model needs to know which tool it called in order to generate its next response correctly.

Second, tool results go under the user role. Counterintuitive, but the API treats tool execution results as coming from the environment (the user side), not the assistant.

Building Real Tools: Calculator, Date, File Reader

The tool execution function is straightforward. It takes a name and input, returns a string:

from datetime import datetime
import pytz
import json
import operator
from typing import Any

# Safe math — uses operator mapping instead of string expression execution
SAFE_OPERATIONS = {
    "add": operator.add,
    "subtract": operator.sub,
    "multiply": operator.mul,
    "divide": operator.truediv,
    "power": operator.pow,
    "modulo": operator.mod,
}

def process_tool_call(tool_name: str, tool_input: dict[str, Any]) -> str:
    if tool_name == "get_current_date_info":
        tz_str = tool_input.get("timezone", "UTC")
        try:
            tz = pytz.timezone(tz_str)
            now = datetime.now(tz)
            day_of_year = now.timetuple().tm_yday
            days_remaining = 365 - day_of_year
            return json.dumps({
                "date": now.strftime("%Y-%m-%d"),
                "time": now.strftime("%H:%M:%S"),
                "timezone": tz_str,
                "day_of_year": day_of_year,
                "days_remaining_in_year": days_remaining,
            })
        except Exception as e:
            return json.dumps({"error": str(e)})
    
    elif tool_name == "calculate":
        op_name = tool_input.get("operation")
        a = tool_input.get("a", 0)
        b = tool_input.get("b", 0)
        op_func = SAFE_OPERATIONS.get(op_name)
        if op_func is None:
            return f"Error: Unknown operation: {op_name}"
        try:
            if op_name == "divide" and b == 0:
                return "Error: Cannot divide by zero"
            result = op_func(a, b)
            return str(result)
        except Exception as e:
            return f"Error: {e}"
    
    elif tool_name == "read_file":
        import os
        filepath = tool_input.get("path", "")
        # Path traversal prevention: only allow within designated base directory
        allowed_base = "/app/data"
        abs_path = os.path.realpath(filepath)
        if not abs_path.startswith(allowed_base):
            return "Error: Path not allowed"
        try:
            with open(abs_path, "r") as f:
                content = f.read(2000)  # 2KB limit
            return content
        except FileNotFoundError:
            return f"Error: File not found: {filepath}"
    
    return f"Error: Unknown tool: {tool_name}"

Actual sandbox results:

calculate(multiply, 15, 7) = 105
calculate(add, 105, 3) = 108
calculate(divide, 100, 4) = 25.0
Input validation (required field present): True
Input validation (missing required field): False — Missing required field: location

The error classification strategy from the FastAPI + Claude API streaming guide applies here too. Categorize tool errors as retryable versus non-retryable for better production stability.

Handling Multiple Tool Calls: Can We Run in Parallel?

Claude can call multiple tools simultaneously in a single turn. Ask “compare the weather in Seoul and Tokyo” and it returns two get_weather calls at once.

# When Claude calls multiple tools in one turn
tool_use_blocks = [b for b in response.content if b.type == "tool_use"]

# Technically possible to run in parallel
from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_tool_call, block.name, block.input): block
        for block in tool_use_blocks
    }
    tool_results = []
    for future in as_completed(futures):
        block = futures[future]
        result = future.result()
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": result,
        })

Sandbox-verified multi-tool results:

{"type": "tool_result", "tool_use_id": "tool_1", "content": "25.0"}
{"type": "tool_result", "tool_use_id": "tool_2", "content": "{\"temp\": 18, \"condition\": \"Sunny\"}"}

I’d only apply parallel execution to idempotent read tools. External API calls with side effects need careful rate-limit and ordering consideration.

Error Handling: Failing Gracefully

When a tool fails, return is_error: true. The model reads this, recognizes the error, and either tries something else or gives the user contextual guidance.

def safe_process_tool_call(tool_name: str, tool_input: dict) -> tuple[str, bool]:
    """Tool execution with error handling. Returns (content, is_error)."""
    try:
        result = process_tool_call(tool_name, tool_input)
        return result, False
    except Exception as e:
        error_msg = f"Tool execution failed: {type(e).__name__}: {str(e)}"
        return error_msg, True

for block in response.content:
    if block.type == "tool_use":
        content, is_error = safe_process_tool_call(block.name, block.input)
        tool_result = {
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": content,
        }
        if is_error:
            tool_result["is_error"] = True
        tool_results.append(tool_result)

When is_error: true is set, the model doesn’t just skip past it. From my testing, it reads the error content and responds with something like “The file couldn’t be found, so please double-check the path.” Returning empty strings or ignoring errors tends to produce confused or hallucinated responses.

The Real Cost of Tool Use: How Many Tokens Does It Add?

Honestly, Tool Use costs more. According to Anthropic’s documentation, each tool definition adds roughly 200–300 tokens of overhead.

5 tool definitions → ~1,250 tokens fixed overhead (every request)
1 tool call → additional input + output tokens
3-turn agentic loop → accumulating context

The agentic loop accumulates context. After 5 turns, everything from the first message to the fifth tool result is in context. Costs can compound quickly in long-running agents.

Two ways to manage this:

1. Combine with Prompt Caching: Tool definitions are the same on every request. As covered in the Claude API Prompt Caching guide, caching the system prompt with cache_control: {"type": "ephemeral"} applies here too, and tool definitions benefit similarly from repeated identical structures.

2. Pass only the tools you need: Always including 10 tool definitions is worse than passing the 2–3 that matter for the current task. More tools consume more tokens and occasionally lead the model to pick the wrong one.

Streaming Tool Use

Tool Use works with streaming responses. In anthropic 0.101.0, use client.messages.stream:

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=tools,
    messages=messages,
) as stream:
    # Stream text chunks in real time
    for text_chunk in stream.text_stream:
        print(text_chunk, end="", flush=True)
    
    # Get the final message after streaming completes
    final_message = stream.get_final_message()

if final_message.stop_reason == "tool_use":
    # ... same handling as above

When streaming with tool use: if you’re showing text chunks to the user in real time and also need to process tool calls, design the UX flow before you start. The Vercel AI SDK approach is worth looking at to see how this gets abstracted on the frontend side.

Production Pattern: GitHub Issue Monitor Agent

A complete example that ties everything together. This is a simple agent that fetches and summarizes GitHub issues:

import anthropic
import json
from typing import Any

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

tools = [
    {
        "name": "list_github_issues",
        "description": "Fetches the issue list for a GitHub repository.",
        "input_schema": {
            "type": "object",
            "properties": {
                "repo": {"type": "string", "description": "owner/repo format"},
                "state": {"type": "string", "enum": ["open", "closed", "all"]},
                "limit": {"type": "integer", "description": "Max issues to return (default: 10)"}
            },
            "required": ["repo"]
        }
    },
    {
        "name": "get_issue_detail",
        "description": "Fetches the details of a specific GitHub issue.",
        "input_schema": {
            "type": "object",
            "properties": {
                "repo": {"type": "string", "description": "owner/repo format"},
                "issue_number": {"type": "integer", "description": "Issue number"}
            },
            "required": ["repo", "issue_number"]
        }
    }
]

def process_tool_call(tool_name: str, tool_input: dict[str, Any]) -> str:
    if tool_name == "list_github_issues":
        # Real impl: requests.get(f"https://api.github.com/repos/{repo}/issues", ...)
        return json.dumps([
            {"number": 42, "title": "TypeError in data processor", "state": "open"},
            {"number": 41, "title": "Add streaming support", "state": "open"},
        ])
    elif tool_name == "get_issue_detail":
        return json.dumps({
            "number": tool_input["issue_number"],
            "body": "Reproduce: pass an empty list as input. Stack trace attached.",
            "comments": 3
        })
    return "Unknown tool"

def run_issue_agent(query: str) -> str:
    messages = [{"role": "user", "content": query}]
    
    for _ in range(10):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        
        if response.stop_reason == "end_turn":
            return next(
                (block.text for block in response.content if hasattr(block, "text")),
                "No response"
            )
        
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = process_tool_call(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "user", "content": tool_results})
    
    return "Loop limit exceeded"

What’s Still Unresolved: Honest Limitations

Here’s what I find genuinely frustrating about Tool Use in practice.

Context accumulation: The agentic loop keeps growing the context. After 10 turns, everything from the first message to the tenth tool result is in there. Long-running agents need a context management strategy. You summarize intermediate results or prune stale messages by hand, and there’s no standard pattern for this yet.

Non-deterministic tool selection: Same question, different tool selection on different runs. Even with temperature=0, you can’t guarantee identical behavior across invocations. This makes testing harder than it should be.

Description quality is everything: Vague description → wrong tool selection or no tool use at all. Writing good tool descriptions is its own prompt engineering discipline. No framework solves this for you.

I think Tool Use is underappreciated. Agent frameworks offer impressive abstractions, but this pattern is what’s running underneath all of them. PydanticAI’s type-safe tool definitions are a convenient layer that auto-generates the JSON schema, but understanding the underlying mechanism is what gets you unstuck when things break.

When to Use Tool Use, and When to Avoid It

Here’s the decision rule I landed on after building with it. You don’t need to bolt a tool onto every chatbot call.

When Tool Use fits:

  • Accuracy matters more than fluency. Date math, currency conversion, numeric computation — anything that must not be wrong — should be delegated to a function rather than generated by the model.
  • You need real-time data the model doesn’t have. Information past the training cutoff, internal databases, external API responses: tools are the only way in.
  • You’re executing actions with side effects. Writing files, sending email, creating tickets. The model decides “what” while verified code handles execution. That separation is what keeps it safe.
  • You need multi-step results stitched together. Fetch an issue list, read the detail, summarize it. The agentic loop handles this kind of chained work naturally.

When to avoid Tool Use:

  • Simple Q&A the model already knows. Adding a tool to “how do I sort a list in Python” just inflates token overhead for nothing.
  • Latency-sensitive real-time UX. The agentic loop adds a round trip per tool call. If a single response has to be fast, cap the loop iterations hard or skip tools entirely.
  • High-volume batches under a tight cost ceiling. The ~250-token fixed overhead per tool definition and the accumulating context multiply across every call. For millions of records, a single tool-free call can be cheaper.
  • Pipelines that require determinism. Tool selection itself is non-deterministic. If your workflow needs the exact same call sequence every time, rule-based code beats it.

The test is simple. Ask yourself whether the model would get it wrong by answering directly, or whether it needs to fetch something it doesn’t know. Either one points to Tool Use; otherwise a plain call is fine. The point where you graduate to heavier multi-agent orchestration is when you compose multiple agents with Claude Agent Teams, but nail single-agent Tool Use first.

Official Sources I Relied On

Every pattern in this post was validated against Anthropic’s official docs. Primary sources for readers who want to dig deeper:

If you want to package tools as a reusable server over MCP, building a Python MCP server with FastMCP covers the next step: putting this Tool Use pattern on a standard protocol.

The Five Things That Matter

Validated findings from anthropic 0.101.0:

  • Tool definitions: name + description + input_schema. Description quality determines whether the tool gets used correctly.
  • Agentic loop: Detect stop_reason == "tool_use" → execute tool → append tool_result → repeat. Simple pattern, but the message structure has to be exactly right.
  • Error handling: Use is_error: true so the model recognizes failures and responds appropriately. Never return empty strings.
  • Cost: ~250 tokens overhead per tool definition. Combine with Prompt Caching. Watch context accumulation in multi-turn agents.
  • Parallel tool calls: ThreadPoolExecutor works for idempotent read tools. Apply selectively.

Tool Use is the most direct path from chatbot to agent. You don’t need a complex framework. This pattern alone is enough to build practical agents.

Frequently Asked Questions

How is Tool Use different from a plain chatbot call?
A chatbot has the model generate the answer itself, so tasks like date math can be confidently wrong. With Tool Use the model only decides which tool to call and delegates the actual execution to Python code, so a precise function like datetime returns the result. That delegation is what creates reliability.
What does a tool definition require?
Each tool is made of three parts: name, description, and input_schema. The name is the identifier, input_schema is the JSON schema for input parameters, and the description is what the model reads to decide when to use the tool. A vague description makes the model pick the wrong tool or skip it entirely.
What role should tool execution results be added under?
Tool results must be added with the user role. It feels like it should be assistant, but the API treats tool output as something the user (environment) returns. You also must append the full response.content as the assistant message rather than extracting only block.text, so the model remembers which tool it called.
How much does Tool Use cost and how can I reduce it?
Per Anthropic docs each tool definition adds roughly 200 to 300 tokens of fixed overhead on every request, and the agentic loop accumulates context. You can cut cost by applying Prompt Caching (cache_control ephemeral) to the tool definitions and system prompt, and by passing only the 2 or 3 tools the current task actually needs.

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee.

About the Author

jw

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.

Back to Blog