AI Development

Working with the Claude API: Authentication, Rate Limits, and Scaling

The Claude API gives you programmatic access to Claude's capabilities — text generation, vision, tool use, and more — for building production applications. T...

Working with the Claude API: Authentication, Rate Limits, and Scaling

The Claude API gives you programmatic access to Claude's capabilities — text generation, vision, tool use, and more — for building production applications. This guide covers the practical details of API integration: authentication, handling rate limits, error recovery, and scaling patterns that keep your application reliable and cost-effective.

Authentication

API Keys

Every request to the Claude API requires an API key passed in the x-api-key header:

curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6-20260319",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello, Claude"}]
  }'

Generate API keys in the Anthropic Console. Best practices:

  • Use separate keys per environment — development, staging, production
  • Rotate keys periodically — especially after team member departures
  • Never commit keys to version control — use environment variables or a secrets manager
  • Set key-level permissions — restrict keys to specific models or features when possible

SDK Authentication

The official SDKs read the ANTHROPIC_API_KEY environment variable by default:

# Python
import anthropic
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY automatically

# Or pass explicitly
client = anthropic.Anthropic(api_key="sk-ant-...")
// TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();  // reads ANTHROPIC_API_KEY automatically

Authentication Errors

Error Meaning Fix
401 Unauthorized Invalid or missing API key Check key is correct and not expired
403 Forbidden Key lacks permission for this operation Check key permissions in Console

These indicate configuration issues, not service availability problems.

Rate Limits

The Claude API enforces three types of rate limits:

  • Requests per minute (RPM) — total API calls
  • Input tokens per minute (ITPM) — tokens sent to Claude
  • Output tokens per minute (OTPM) — tokens generated by Claude

Rate Limit Tiers

Limits increase as you spend more:

Tier Requirement RPM (Sonnet) ITPM
Tier 1 $5 credit purchase 50 30,000
Tier 2 $40 credit purchase 1,000 80,000
Tier 3 $200 credit purchase 2,000 400,000
Tier 4 $400 credit purchase 4,000 2,000,000

Exact limits vary by model — Opus has lower RPM than Sonnet at the same tier.

Handling 429 Errors

When you exceed a limit, the API returns a 429 Too Many Requests response with:

  • error.type: "rate_limit_error"
  • error.message: describes which limit you hit
  • retry-after header: seconds to wait before retrying
import time
import anthropic

client = anthropic.Anthropic()

def call_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6-20260319",
                max_tokens=1024,
                messages=messages,
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            retry_after = int(e.response.headers.get("retry-after", 60))
            time.sleep(retry_after)

Rate Limit Best Practices

Implement exponential backoff. Don't hammer the API after a 429 — wait the retry-after duration, then gradually increase wait times on subsequent failures.

Use request queuing. For high-throughput applications, maintain a queue that dispatches requests at a rate below your limit rather than bursting and hitting 429s.

Monitor usage proactively. The API returns rate limit headers on every response:

  • anthropic-ratelimit-requests-limit
  • anthropic-ratelimit-requests-remaining
  • anthropic-ratelimit-requests-reset

Track these to see how close you are to limits before you hit them.

Making API Requests

The Messages API

All Claude interactions use the Messages API:

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    system="You are a helpful code reviewer.",
    messages=[
        {"role": "user", "content": "Review this function for bugs: ..."},
    ],
)

print(message.content[0].text)

Streaming

For real-time applications, use streaming to get tokens as they're generated:

with client.messages.stream(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain microservices architecture"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Streaming reduces perceived latency — users see output immediately rather than waiting for the full response.

Multi-Turn Conversations

Maintain conversation context by passing the full message history:

messages = [
    {"role": "user", "content": "What's the time complexity of quicksort?"},
    {"role": "assistant", "content": "Quicksort has an average time complexity of O(n log n)..."},
    {"role": "user", "content": "What about the worst case? When does it happen?"},
]

response = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    messages=messages,
)

Tool Use (Function Calling)

Claude can call functions you define, enabling it to interact with external systems:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state"},
            },
            "required": ["location"],
        },
    }
]

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
)

# Check if Claude wants to use a tool
for block in message.content:
    if block.type == "tool_use":
        # Execute the function and return results
        tool_result = get_weather(block.input["location"])
        # Send the result back to Claude for final response

Error Handling

Build resilient applications by handling the full error spectrum:

import anthropic

try:
    message = client.messages.create(...)
except anthropic.AuthenticationError:
    # 401 — bad API key
    log.error("Invalid API key")
except anthropic.PermissionDeniedError:
    # 403 — insufficient permissions
    log.error("API key lacks required permissions")
except anthropic.RateLimitError:
    # 429 — rate limited, retry with backoff
    log.warning("Rate limited, retrying...")
except anthropic.APIStatusError as e:
    # 500, 529 — server error or overloaded
    log.error(f"API error: {e.status_code}")
except anthropic.APIConnectionError:
    # Network error
    log.error("Could not connect to Anthropic API")

Retryable vs. Non-Retryable Errors

Error Retryable? Action
401, 403 No Fix authentication
400 No Fix request format
429 Yes Wait and retry
500 Yes Retry with backoff
529 (Overloaded) Yes Retry with longer backoff

Scaling for Production

Request Batching

For non-time-sensitive workloads, the Message Batches API processes requests asynchronously at 50% of standard pricing:

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "request-1",
            "params": {
                "model": "claude-sonnet-4-6-20260319",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": "Summarize this article: ..."}],
            },
        },
        # ... hundreds or thousands more requests
    ]
)

Batch requests don't count against standard rate limits and can contain thousands of items. Results are available within 24 hours.

Prompt Caching

Reduce costs on repeated context by caching static prompt components:

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst...[large system prompt]...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Analyze this contract clause: ..."}],
)

Cache hits cost 90% less than standard input tokens. The cache lasts 5 minutes by default (1.25x write cost) or 1 hour (2x write cost) — either pays for itself after 1-2 cache reads.

Model Routing

Use different models for different tasks to optimize cost and latency:

def route_to_model(task_type: str) -> str:
    if task_type in ("classification", "extraction", "routing"):
        return "claude-haiku-4-5-20251001"    # Fast, cheap
    elif task_type in ("coding", "analysis", "writing"):
        return "claude-sonnet-4-6-20260319"   # Balanced
    elif task_type in ("complex_reasoning", "architecture"):
        return "claude-opus-4-6-20260319"     # Most capable

Related Resources

Ready to build?

Go from idea to launched product in a week with AI-assisted development.