Unstructured

Unstructured is a data processing library and API for extracting content from documents like PDFs, Word files, HTML pages, images, and more. It converts unstructured documents into clean, structured data that can be used in AI applications, RAG pipelines, and search systems.

Why Unstructured?

Document parsing: Extract text, tables, and metadata from PDFs, DOCX, PPTX, and more
Multiple formats: Supports 20+ file types including images with OCR
Chunking: Automatically splits documents into semantically meaningful chunks
RAG-ready: Output is optimized for embedding and retrieval in AI applications
API and library: Use as a Python library or hosted API service

Use Cases

RAG pipelines: Process documents for retrieval-augmented generation
Document search: Index documents for semantic search
Data extraction: Pull structured data from invoices, reports, and forms
Knowledge bases: Build searchable knowledge bases from document collections
Compliance: Extract and analyze content from legal and regulatory documents

Getting Started (API)

const response = await fetch('https://api.unstructured.io/general/v0/general', {
  method: 'POST',
  headers: {
    'unstructured-api-key': process.env.UNSTRUCTURED_API_KEY!,
  },
  body: formData, // FormData with the file
});

const elements = await response.json();
// elements is an array of extracted content blocks

Integration with AI

import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

// After extracting document content with Unstructured:
const documentText = elements.map(el => el.text).join('\n');

const { text } = await generateText({
  model: openai('gpt-4-turbo'),
  prompt: `Summarize this document:\n\n${documentText}`,
});

Unstructured

Unstructured

Why Unstructured?

Use Cases

Getting Started (API)

Integration with AI

Resources

Related Topics in AI Development