Working with the Claude API: Authentication, Rate Limits, and Scaling
The Claude API gives you programmatic access to Claude's capabilities — text generation, vision, tool use, and more — for building production applications. This guide covers the practical details of API integration: authentication, handling rate limits, error recovery, and scaling patterns that keep your application reliable and cost-effective.
Authentication
API Keys
Every request to the Claude API requires an API key passed in the x-api-key header:
curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-sonnet-4-6-20260319",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello, Claude"}]
}'
Generate API keys in the Anthropic Console. Best practices:
- Use separate keys per environment — development, staging, production
- Rotate keys periodically — especially after team member departures
- Never commit keys to version control — use environment variables or a secrets manager
- Set key-level permissions — restrict keys to specific models or features when possible
SDK Authentication
The official SDKs read the ANTHROPIC_API_KEY environment variable by default:
# Python
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY automatically
# Or pass explicitly
client = anthropic.Anthropic(api_key="sk-ant-...")
// TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY automatically
Authentication Errors
| Error | Meaning | Fix |
|---|---|---|
401 Unauthorized |
Invalid or missing API key | Check key is correct and not expired |
403 Forbidden |
Key lacks permission for this operation | Check key permissions in Console |
These indicate configuration issues, not service availability problems.
Rate Limits
The Claude API enforces three types of rate limits:
- Requests per minute (RPM) — total API calls
- Input tokens per minute (ITPM) — tokens sent to Claude
- Output tokens per minute (OTPM) — tokens generated by Claude
Rate Limit Tiers
Limits increase as you spend more:
| Tier | Requirement | RPM (Sonnet) | ITPM |
|---|---|---|---|
| Tier 1 | $5 credit purchase | 50 | 30,000 |
| Tier 2 | $40 credit purchase | 1,000 | 80,000 |
| Tier 3 | $200 credit purchase | 2,000 | 400,000 |
| Tier 4 | $400 credit purchase | 4,000 | 2,000,000 |
Exact limits vary by model — Opus has lower RPM than Sonnet at the same tier.
Handling 429 Errors
When you exceed a limit, the API returns a 429 Too Many Requests response with:
error.type:"rate_limit_error"error.message: describes which limit you hitretry-afterheader: seconds to wait before retrying
import time
import anthropic
client = anthropic.Anthropic()
def call_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.messages.create(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
messages=messages,
)
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
retry_after = int(e.response.headers.get("retry-after", 60))
time.sleep(retry_after)
Rate Limit Best Practices
Implement exponential backoff. Don't hammer the API after a 429 — wait the retry-after duration, then gradually increase wait times on subsequent failures.
Use request queuing. For high-throughput applications, maintain a queue that dispatches requests at a rate below your limit rather than bursting and hitting 429s.
Monitor usage proactively. The API returns rate limit headers on every response:
anthropic-ratelimit-requests-limitanthropic-ratelimit-requests-remaininganthropic-ratelimit-requests-reset
Track these to see how close you are to limits before you hit them.
Making API Requests
The Messages API
All Claude interactions use the Messages API:
message = client.messages.create(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
system="You are a helpful code reviewer.",
messages=[
{"role": "user", "content": "Review this function for bugs: ..."},
],
)
print(message.content[0].text)
Streaming
For real-time applications, use streaming to get tokens as they're generated:
with client.messages.stream(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain microservices architecture"}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Streaming reduces perceived latency — users see output immediately rather than waiting for the full response.
Multi-Turn Conversations
Maintain conversation context by passing the full message history:
messages = [
{"role": "user", "content": "What's the time complexity of quicksort?"},
{"role": "assistant", "content": "Quicksort has an average time complexity of O(n log n)..."},
{"role": "user", "content": "What about the worst case? When does it happen?"},
]
response = client.messages.create(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
messages=messages,
)
Tool Use (Function Calling)
Claude can call functions you define, enabling it to interact with external systems:
tools = [
{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state"},
},
"required": ["location"],
},
}
]
message = client.messages.create(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
)
# Check if Claude wants to use a tool
for block in message.content:
if block.type == "tool_use":
# Execute the function and return results
tool_result = get_weather(block.input["location"])
# Send the result back to Claude for final response
Error Handling
Build resilient applications by handling the full error spectrum:
import anthropic
try:
message = client.messages.create(...)
except anthropic.AuthenticationError:
# 401 — bad API key
log.error("Invalid API key")
except anthropic.PermissionDeniedError:
# 403 — insufficient permissions
log.error("API key lacks required permissions")
except anthropic.RateLimitError:
# 429 — rate limited, retry with backoff
log.warning("Rate limited, retrying...")
except anthropic.APIStatusError as e:
# 500, 529 — server error or overloaded
log.error(f"API error: {e.status_code}")
except anthropic.APIConnectionError:
# Network error
log.error("Could not connect to Anthropic API")
Retryable vs. Non-Retryable Errors
| Error | Retryable? | Action |
|---|---|---|
| 401, 403 | No | Fix authentication |
| 400 | No | Fix request format |
| 429 | Yes | Wait and retry |
| 500 | Yes | Retry with backoff |
| 529 (Overloaded) | Yes | Retry with longer backoff |
Scaling for Production
Request Batching
For non-time-sensitive workloads, the Message Batches API processes requests asynchronously at 50% of standard pricing:
batch = client.messages.batches.create(
requests=[
{
"custom_id": "request-1",
"params": {
"model": "claude-sonnet-4-6-20260319",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Summarize this article: ..."}],
},
},
# ... hundreds or thousands more requests
]
)
Batch requests don't count against standard rate limits and can contain thousands of items. Results are available within 24 hours.
Prompt Caching
Reduce costs on repeated context by caching static prompt components:
message = client.messages.create(
model="claude-sonnet-4-6-20260319",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal document analyst...[large system prompt]...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Analyze this contract clause: ..."}],
)
Cache hits cost 90% less than standard input tokens. The cache lasts 5 minutes by default (1.25x write cost) or 1 hour (2x write cost) — either pays for itself after 1-2 cache reads.
Model Routing
Use different models for different tasks to optimize cost and latency:
def route_to_model(task_type: str) -> str:
if task_type in ("classification", "extraction", "routing"):
return "claude-haiku-4-5-20251001" # Fast, cheap
elif task_type in ("coding", "analysis", "writing"):
return "claude-sonnet-4-6-20260319" # Balanced
elif task_type in ("complex_reasoning", "architecture"):
return "claude-opus-4-6-20260319" # Most capable
Related Resources
- Prompt Engineering for Claude — writing effective prompts
- Cost Optimization: Batch Processing and Prompt Caching — reducing API costs
- Claude Vision Guide — image analysis via the API
- Data Processing and Analysis with Claude — building data pipelines