Working with the Claude API: Authentication, Rate Limits, and Scaling

The Claude API gives you programmatic access to Claude's capabilities — text generation, vision, tool use, and more — for building production applications. This guide covers the practical details of API integration: authentication, handling rate limits, error recovery, and scaling patterns that keep your application reliable and cost-effective.

Authentication

API Keys

Every request to the Claude API requires an API key passed in the x-api-key header:

curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6-20260319",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello, Claude"}]
  }'

Generate API keys in the Anthropic Console. Best practices:

Use separate keys per environment — development, staging, production
Rotate keys periodically — especially after team member departures
Never commit keys to version control — use environment variables or a secrets manager
Set key-level permissions — restrict keys to specific models or features when possible

SDK Authentication

The official SDKs read the ANTHROPIC_API_KEY environment variable by default:

# Python
import anthropic
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY automatically

# Or pass explicitly
client = anthropic.Anthropic(api_key="sk-ant-...")

// TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();  // reads ANTHROPIC_API_KEY automatically

Authentication Errors

Error	Meaning	Fix
`401 Unauthorized`	Invalid or missing API key	Check key is correct and not expired
`403 Forbidden`	Key lacks permission for this operation	Check key permissions in Console

These indicate configuration issues, not service availability problems.

Rate Limits

The Claude API enforces three types of rate limits:

Requests per minute (RPM) — total API calls
Input tokens per minute (ITPM) — tokens sent to Claude
Output tokens per minute (OTPM) — tokens generated by Claude

Rate Limit Tiers

Limits increase as you spend more:

Tier	Requirement	RPM (Sonnet)	ITPM
Tier 1	$5 credit purchase	50	30,000
Tier 2	$40 credit purchase	1,000	80,000
Tier 3	$200 credit purchase	2,000	400,000
Tier 4	$400 credit purchase	4,000	2,000,000

Exact limits vary by model — Opus has lower RPM than Sonnet at the same tier.

Handling 429 Errors

When you exceed a limit, the API returns a 429 Too Many Requests response with:

error.type: "rate_limit_error"
error.message: describes which limit you hit
retry-after header: seconds to wait before retrying

import time
import anthropic

client = anthropic.Anthropic()

def call_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6-20260319",
                max_tokens=1024,
                messages=messages,
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            retry_after = int(e.response.headers.get("retry-after", 60))
            time.sleep(retry_after)

Rate Limit Best Practices

Implement exponential backoff. Don't hammer the API after a 429 — wait the retry-after duration, then gradually increase wait times on subsequent failures.

Use request queuing. For high-throughput applications, maintain a queue that dispatches requests at a rate below your limit rather than bursting and hitting 429s.

Monitor usage proactively. The API returns rate limit headers on every response:

anthropic-ratelimit-requests-limit
anthropic-ratelimit-requests-remaining
anthropic-ratelimit-requests-reset

Track these to see how close you are to limits before you hit them.

Making API Requests

The Messages API

All Claude interactions use the Messages API:

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    system="You are a helpful code reviewer.",
    messages=[
        {"role": "user", "content": "Review this function for bugs: ..."},
    ],
)

print(message.content[0].text)

Streaming

For real-time applications, use streaming to get tokens as they're generated:

with client.messages.stream(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain microservices architecture"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Streaming reduces perceived latency — users see output immediately rather than waiting for the full response.

Multi-Turn Conversations

Maintain conversation context by passing the full message history:

messages = [
    {"role": "user", "content": "What's the time complexity of quicksort?"},
    {"role": "assistant", "content": "Quicksort has an average time complexity of O(n log n)..."},
    {"role": "user", "content": "What about the worst case? When does it happen?"},
]

response = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    messages=messages,
)

Tool Use (Function Calling)

Claude can call functions you define, enabling it to interact with external systems:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state"},
            },
            "required": ["location"],
        },
    }
]

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
)

# Check if Claude wants to use a tool
for block in message.content:
    if block.type == "tool_use":
        # Execute the function and return results
        tool_result = get_weather(block.input["location"])
        # Send the result back to Claude for final response

Error Handling

Build resilient applications by handling the full error spectrum:

import anthropic

try:
    message = client.messages.create(...)
except anthropic.AuthenticationError:
    # 401 — bad API key
    log.error("Invalid API key")
except anthropic.PermissionDeniedError:
    # 403 — insufficient permissions
    log.error("API key lacks required permissions")
except anthropic.RateLimitError:
    # 429 — rate limited, retry with backoff
    log.warning("Rate limited, retrying...")
except anthropic.APIStatusError as e:
    # 500, 529 — server error or overloaded
    log.error(f"API error: {e.status_code}")
except anthropic.APIConnectionError:
    # Network error
    log.error("Could not connect to Anthropic API")

Retryable vs. Non-Retryable Errors

Error	Retryable?	Action
401, 403	No	Fix authentication
400	No	Fix request format
429	Yes	Wait and retry
500	Yes	Retry with backoff
529 (Overloaded)	Yes	Retry with longer backoff

Scaling for Production

Request Batching

For non-time-sensitive workloads, the Message Batches API processes requests asynchronously at 50% of standard pricing:

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "request-1",
            "params": {
                "model": "claude-sonnet-4-6-20260319",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": "Summarize this article: ..."}],
            },
        },
        # ... hundreds or thousands more requests
    ]
)

Batch requests don't count against standard rate limits and can contain thousands of items. Results are available within 24 hours.

Prompt Caching

Reduce costs on repeated context by caching static prompt components:

message = client.messages.create(
    model="claude-sonnet-4-6-20260319",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst...[large system prompt]...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Analyze this contract clause: ..."}],
)

Cache hits cost 90% less than standard input tokens. The cache lasts 5 minutes by default (1.25x write cost) or 1 hour (2x write cost) — either pays for itself after 1-2 cache reads.

Model Routing

Use different models for different tasks to optimize cost and latency:

def route_to_model(task_type: str) -> str:
    if task_type in ("classification", "extraction", "routing"):
        return "claude-haiku-4-5-20251001"    # Fast, cheap
    elif task_type in ("coding", "analysis", "writing"):
        return "claude-sonnet-4-6-20260319"   # Balanced
    elif task_type in ("complex_reasoning", "architecture"):
        return "claude-opus-4-6-20260319"     # Most capable

Related Resources

Prompt Engineering for Claude — writing effective prompts
Cost Optimization: Batch Processing and Prompt Caching — reducing API costs
Claude Vision Guide — image analysis via the API
Data Processing and Analysis with Claude — building data pipelines

Working with the Claude API: Authentication, Rate Limits, and Scaling

Working with the Claude API: Authentication, Rate Limits, and Scaling

Authentication

API Keys

SDK Authentication

Authentication Errors

Rate Limits

Rate Limit Tiers

Handling 429 Errors

Rate Limit Best Practices

Making API Requests

The Messages API

Streaming

Multi-Turn Conversations

Tool Use (Function Calling)

Error Handling

Retryable vs. Non-Retryable Errors

Scaling for Production

Request Batching

Prompt Caching

Model Routing

Related Resources

Related Topics in AI Development