Data Processing and Analysis with Claude

Claude handles data transformation, cleaning, analysis, and insight generation through natural language instructions. Instead of writing one-off scripts for each data task, you describe what you need and Claude writes the code, processes the data, or does the analysis directly — depending on whether you're working through the API, Claude Code, or Claude.ai.

Where Claude Fits in a Data Pipeline

Claude isn't a replacement for your database or data warehouse. It excels at the messy, judgment-heavy parts of data work that are tedious to automate with traditional code:

Data cleaning — handling inconsistent formats, fixing encoding issues, normalizing messy text fields
Schema inference — figuring out what columns mean in an undocumented dataset
Natural language to SQL — converting business questions to queries
Data enrichment — categorizing, tagging, or extracting structured data from unstructured text
Analysis and reporting — generating summaries, finding patterns, creating visualizations

Data Processing with Claude Code

Claude Code is the most hands-on way to do data work with Claude. It runs in your terminal with access to your files and can execute Python, SQL, and shell commands.

Exploring Unfamiliar Data

Point Claude at a dataset and let it figure out the structure:

> I have a CSV at data/customer_export.csv that came from our CRM.
  Read the first 100 rows, describe the schema, identify data quality
  issues, and suggest cleanup steps.

Claude reads the file, examines column types, checks for nulls and duplicates, identifies formatting inconsistencies, and reports back with specific findings.

Cleaning and Transformation

Describe the transformation in plain English:

> Clean data/raw_transactions.csv:
  - Normalize all date columns to ISO 8601 format
  - Strip whitespace from string fields
  - Convert currency strings ("$1,234.56") to floats
  - Deduplicate rows based on transaction_id, keeping the most recent
  - Flag rows with missing required fields (amount, date, customer_id)
  Write the cleaned output to data/clean_transactions.csv and a report
  of what was fixed to data/cleaning_report.md

Claude writes a Python script, runs it against your data, and produces both the cleaned output and a report of every change made.

SQL Generation

Ask business questions and get working queries:

> Given the schema in migrations/001_create_tables.sql, write a query
  that shows: monthly revenue by product category for the last 12 months,
  with month-over-month growth rate, excluding refunded orders.
  Optimize for PostgreSQL.

Claude reads your schema file, writes the query with proper joins and window functions, and can run it against your database if you have a connection configured.

Data Processing via the API

For automated data pipelines, use the Claude API to process data programmatically.

Text Classification

Classify documents, tickets, or any text into categories:

import anthropic
import csv

client = anthropic.Anthropic()

def classify_ticket(text: str) -> dict:
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Fast and cheap for classification
        max_tokens=256,
        system="Classify support tickets. Return JSON: {category, priority, sentiment}",
        messages=[{"role": "user", "content": text}],
    )
    return json.loads(message.content[0].text)

# Process a batch of tickets
with open("tickets.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        result = classify_ticket(row["description"])
        print(f"Ticket {row['id']}: {result}")

For high volume, use the Batch API for 50% cost savings:

batch_requests = [
    {
        "custom_id": f"ticket-{row['id']}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 256,
            "system": "Classify support tickets. Return JSON: {category, priority, sentiment}",
            "messages": [{"role": "user", "content": row["description"]}],
        },
    }
    for row in tickets
]

batch = client.messages.batches.create(requests=batch_requests)

Data Extraction from Unstructured Text

Pull structured data from documents, emails, or web content:

EXTRACTION_PROMPT = """Extract the following fields from this job posting.
Return as JSON:
{
  "company": string,
  "title": string,
  "location": string | "remote",
  "salary_min": number | null,
  "salary_max": number | null,
  "experience_years": number | null,
  "skills": string[],
  "employment_type": "full-time" | "part-time" | "contract"
}

Job posting:
{text}"""

def extract_job_data(posting_text: str) -> dict:
    message = client.messages.create(
        model="claude-sonnet-4-6-20260319",
        max_tokens=512,
        messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(text=posting_text)}],
    )
    return json.loads(message.content[0].text)

Data Quality Validation

Use Claude to catch errors that rule-based validation misses:

VALIDATION_PROMPT = """Review this customer record for data quality issues.
Check for:
- Name: plausible human name (flag things like "TEST USER" or "asdf")
- Email: valid format AND plausible domain (flag disposable email services)
- Phone: valid format for the stated country
- Address: all required fields present, zip code matches state

Return JSON:
{
  "valid": boolean,
  "issues": [{"field": string, "issue": string, "severity": "error" | "warning"}]
}

Record: {record}"""

Working with pandas

Claude Code integrates well with Python data workflows. Ask it to write pandas code and run it:

> Load data/sales_2025.csv into pandas. Create a summary that shows:
  - Total revenue by region
  - Top 10 products by units sold
  - Month with highest growth rate
  - Any anomalies in the data (unusual spikes, missing periods)

  Save the summary to data/sales_analysis.md and create a matplotlib
  chart of monthly revenue by region saved to charts/revenue_by_region.png

Claude writes the pandas code, executes it, generates the report and chart, and saves both to your filesystem.

Pivot Tables and Aggregation

> I need a pivot table from data/marketing_spend.csv showing:
  - Rows: channel (Google Ads, Facebook, LinkedIn, etc.)
  - Columns: month
  - Values: spend, conversions, CPA (spend/conversions)
  - Include row totals and column totals

  Export as both CSV and a formatted markdown table.

CSV and JSON Processing Patterns

Large File Processing

For files too large to send in a single API call, process in chunks:

import csv
import json

CHUNK_SIZE = 50  # rows per API call

def process_large_csv(filepath: str):
    results = []
    with open(filepath) as f:
        reader = csv.DictReader(f)
        chunk = []
        for row in reader:
            chunk.append(row)
            if len(chunk) >= CHUNK_SIZE:
                batch_result = process_chunk(chunk)
                results.extend(batch_result)
                chunk = []
        if chunk:  # Process remaining rows
            results.extend(process_chunk(chunk))
    return results

def process_chunk(rows: list[dict]) -> list[dict]:
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Process these records and return enriched JSON:\n{json.dumps(rows)}"
        }],
    )
    return json.loads(message.content[0].text)

Schema Mapping

Map data between different schemas — useful for integrations and migrations:

> I have two CSVs:
  - data/old_format.csv (our legacy system export)
  - data/new_schema.json (the target schema for our new system)

  Create a Python script that reads the old format, maps fields to
  the new schema, handles any data transformations needed (date formats,
  enum values, nested objects), and writes the output as newline-delimited
  JSON to data/migrated.jsonl

Cost Considerations for Data Workloads

Data processing can involve thousands of API calls. Use these strategies to manage costs:

Use Haiku for simple tasks — classification, extraction, and validation work well with the cheapest model
Batch non-urgent work — the Batch API gives you 50% off everything
Cache repeated context — if every request includes the same schema or examples, cache them
Process in chunks — send multiple records per request to amortize the per-request overhead
Filter before sending — pre-filter data with traditional code to only send records that need AI processing

A classification pipeline processing 100,000 records with Haiku via the Batch API costs roughly:

100,000 × 500 input tokens × $0.40/MTok + 100,000 × 100 output tokens × $2/MTok = $40

Related Resources

Cost Optimization: Batch Processing and Prompt Caching — detailed cost reduction strategies
Claude API Integration Guide — API setup and authentication
Claude Vision Guide — processing images and documents
Prompt Engineering for Claude — writing effective data processing prompts

Data Processing and Analysis with Claude

Data Processing and Analysis with Claude

Where Claude Fits in a Data Pipeline

Data Processing with Claude Code

Exploring Unfamiliar Data

Cleaning and Transformation

SQL Generation

Data Processing via the API

Text Classification

Data Extraction from Unstructured Text

Data Quality Validation

Working with pandas

Pivot Tables and Aggregation

CSV and JSON Processing Patterns

Large File Processing

Schema Mapping

Cost Considerations for Data Workloads

Related Resources

Related Topics in Backend & Data