Extraction Flow

Complete Extraction Workflow

Understanding the full extraction lifecycle helps you build robust automation pipelines. Here’s how documents flow through the system:

Phase 1: Document Upload

Upload Files

Submit documents via POST /upload. Supports batching up to 100 files.

import requests

response = requests.post(
    "https://api.documind.com/api/v1/upload",
    headers={"X-API-Key": "your_api_key"},
    files=[
        ("files", open("invoice1.pdf", "rb")),
        ("files", open("invoice2.pdf", "rb"))
    ]
)
document_ids = response.json()
# Returns: ["uuid1", "uuid2"]

Credit Impact: No credits charged for upload
Storage: Documents stored for 30 days

Receive Document IDs

Store the returned UUIDs for extraction requests.

# Map filenames to IDs for tracking
doc_mapping = {
    "invoice1.pdf": document_ids[0],
    "invoice2.pdf": document_ids[1]
}

Phase 2: Schema Definition

Choose one of three approaches:

Predefined Schema (UI Only)
Custom Schema
Generated Schema

Use the Documind UI to access built-in templates for common document types like invoices, receipts, forms, etc.Navigate to: Dashboard → Schemas → Templates

✓ Fastest setup
✓ Proven accuracy
✗ Not available via API (use UI or define custom schema)

Define your own schema for unique documents:

schema = {
    "named_entities": {
        "policy_number": {
            "type": "string",
            "description": "Insurance policy number"
        },
        "coverage_amount": {
            "type": "number",
            "description": "Total coverage in USD"
        }
    },
    "required": ["policy_number"]
}

✓ Fully customizable
✓ Matches your exact needs
✗ Requires domain knowledge

Auto-generate schema from a sample document or description:Option 1: Generate from Sample Document (Recommended)

import requests

# Upload sample document
with open("sample_policy.pdf", "rb") as f:
    response = requests.post(
        "https://api.documind.com/api/v1/upload",
        headers={"X-API-Key": "your_api_key"},
        files={"files": f}
    )
sample_id = response.json()[0]

# Generate schema from sample
response = requests.post(
    f"https://api.documind.com/api/v1/schema/{sample_id}",
    headers={"X-API-Key": "your_api_key"}
)
schema = response.json()["schema"]

Option 2: Generate from Description

response = requests.post(
    "https://api.documind.com/api/v1/schema/generate-dynamic-schema/",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema_name": "insurance_policy",
        "schema_description": "Extract policy number, coverage amount, policyholder name, and effective dates"
    }
)
schema = response.json()

Both methods are available in the UI (Dashboard → Schemas → Generate) and via API

✓ Quick start
✓ AI-generated from your documents
✗ May need refinement

Phase 3: Data Extraction

Configure extraction mode based on requirements:

Mode Selection Decision Tree

Start
  ├─ Need confidence scores? ──No──> Basic Extraction
  │                                   (2-6 credits/page)
  │
  └─ Yes
      ├─ Scanned/Image documents? ──Yes──> VLM Extraction
      │                                     (10 credits/page)
      │
      └─ No (Native PDF/Text)
          └─ Critical accuracy needed? ──Yes──> Advanced Extraction
                                                  (15 credits/page)

Basic Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        "model": "openai-gpt-4.1",  # or "google-gemini-2.0-flash"
        "prompt": "Extract all invoice fields accurately"
    }
)
result = response.json()

# No confidence scores, no review flagging
if result["needs_review"]:  # Always False for Basic mode
    pass
else:
    process_data(result["results"])

Advanced Extraction Example

import requests

response = requests.post(
    f"https://api.documind.com/api/v1/extract/{document_id}",
    headers={"X-API-Key": "your_api_key", "Content-Type": "application/json"},
    json={
        "schema": schema,
        # Advanced mode - no model or extraction_mode specified
        "review_threshold": 85,
        "prompt": "Extract invoice with high accuracy"
    }
)
result = response.json()

# Includes confidence scores
if result["needs_review"]:
    # Some required fields below threshold
    print("⚠️  Needs human review")
    # Proceed to Phase 4
else:
    # All required fields above threshold
    process_data(result["results"])

Credit Usage: Credits deducted per page × model cost

Phase 4: Review Workflow

Only triggered when needs_review = true:

Identify Flagged Fields

Parse the metadata to find low-confidence fields:

for field, flag in result["needs_review_metadata"]["review_flags"].items():
    if flag:
        confidence = result["needs_review_metadata"]["confidence_scores"][field]
        print(f"⚠️  {field}: {confidence}% confidence")

Notify Review Team

Human review happens in the Documind UI. Direct your review team to:Dashboard → Review QueueThey can see all pending reviews, view extraction confidence scores, and correct/approve results.

Optionally, send notifications via your own system:

# Your custom notification logic
send_email(
    to="[email protected]",
    subject=f"Review Needed: {filename}",
    body=f"Document {document_id} needs review at https://app.documind.com/review"
)

Poll for Completion

Implement polling to detect when is_reviewed = true:

import requests
import time

def poll_for_review(document_id, poll_interval=10, timeout=600):
    start_time = time.time()
    
    while (time.time() - start_time) < timeout:
        response = requests.get(
            "https://api.documind.com/api/v1/data/extractions",
            headers={"X-API-Key": "your_api_key"},
            params={"document_id": document_id, "limit": 1}
        )
        data = response.json()
        
        if data["items"] and data["items"][0]["is_reviewed"]:
            return data["items"][0]["reviewed_results"]
        
        time.sleep(poll_interval)
    
    return None

reviewed_data = poll_for_review(document_id)
if reviewed_data:
    process_data(reviewed_data)
else:
    handle_timeout(document_id)

See Polling Pattern for details.

Use Reviewed Results

Once is_reviewed = true, use reviewed_results instead of results:

import requests

response = requests.get(
    "https://api.documind.com/api/v1/data/extractions",
    headers={"X-API-Key": "your_api_key"},
    params={"document_id": document_id, "limit": 1}
)
extraction = response.json()["items"][0]

if extraction["is_reviewed"]:
    # Use human-corrected data
    data = extraction["reviewed_results"]
else:
    # Use original AI extraction
    data = extraction["results"]

Phase 5: Data Processing

Process the final data in your automation:

def process_invoice_data(data):
    """Process extracted/reviewed invoice data."""
    
    # Validate required fields
    assert "invoice_number" in data
    assert "total_amount" in data
    
    # Update your system
    create_accounting_record(
        invoice_number=data["invoice_number"],
        amount=data["total_amount"],
        vendor=data.get("vendor_name"),
        line_items=data.get("line_items", [])
    )
    
    # Archive original document
    archive_document(data["document_id"])
    
    return True

Complete Example

Here’s a full workflow implementation with real API calls:

import requests
import time
from typing import Dict, Optional

class DocumindWorkflow:
    """Complete extraction workflow manager using Documind API."""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.documind.com/api/v1"
        self.headers = {"X-API-Key": api_key}
    
    def process_document(
        self,
        file_path: str,
        schema: Dict,
        mode: str = "advanced"
    ) -> Dict:
        """
        Complete workflow: upload → extract → review → process
        
        Args:
            file_path: Path to document file
            schema: Extraction schema
            mode: 'basic', 'vlm', or 'advanced'
        
        Returns:
            Final extracted data (original or reviewed)
        """
        # Phase 1: Upload via API
        print(f"📤 Uploading {file_path}...")
        with open(file_path, "rb") as f:
            response = requests.post(
                f"{self.base_url}/upload",
                headers=self.headers,
                files={"files": f}
            )
        response.raise_for_status()
        doc_id = response.json()[0]
        print(f"✓ Uploaded: {doc_id}")
        
        # Phase 2 & 3: Extract via API
        print(f"🔍 Extracting data ({mode} mode)...")
        config = {"schema": schema, "prompt": "Extract all data accurately"}
        
        if mode == "basic":
            config["model"] = "openai-gpt-4.1"
        elif mode == "vlm":
            config["extraction_mode"] = "vlm"
            config["review_threshold"] = 80
        else:  # advanced
            config["review_threshold"] = 85
        
        response = requests.post(
            f"{self.base_url}/extract/{doc_id}",
            headers={**self.headers, "Content-Type": "application/json"},
            json=config
        )
        response.raise_for_status()
        result = response.json()
        print(f"✓ Extraction complete")
        
        # Phase 4: Handle review if needed
        if result["needs_review"]:
            print(f"⚠️  Document needs review - direct team to UI: https://app.documind.com/review")
            
            # Poll for review completion
            print(f"⏳ Waiting for human review...")
            start = time.time()
            
            while (time.time() - start) < 600:  # 10 min timeout
                response = requests.get(
                    f"{self.base_url}/data/extractions",
                    headers=self.headers,
                    params={"document_id": doc_id, "limit": 1}
                )
                data = response.json()
                
                if data["items"] and data["items"][0]["is_reviewed"]:
                    print(f"✓ Review completed")
                    return data["items"][0]["reviewed_results"]
                
                time.sleep(10)
            
            raise TimeoutError("Review not completed in time")
        else:
            print(f"✓ No review needed")
            return result["results"]

# Usage
workflow = DocumindWorkflow(api_key="your_api_key")

schema = {
    "named_entities": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"}
    },
    "required": ["invoice_number", "total_amount"]
}

# Process document through complete workflow
try:
    data = workflow.process_document(
        file_path="invoice.pdf",
        schema=schema,
        mode="advanced"
    )
    
    print(f"✓ Final data: {data}")
    # Process your data here
    
except TimeoutError:
    print("Review took too long, escalating...")
except Exception as e:
    print(f"Error: {e}")

Troubleshooting

Upload Fails

Problem: 500 Internal Server Error on uploadSolutions:

Verify file is not corrupted
Check file size < 50MB
Ensure file format is supported
Retry with exponential backoff

Extraction Timeout

Problem: Extraction takes too long or times outSolutions:

Switch to Basic mode for faster processing
Reduce document page count
Simplify schema (fewer fields)
Contact support if issue persists

All Extractions Need Review

Problem: Review threshold too strictSolutions:

Lower review_threshold from 85 to 75
Mark fewer fields as required
Improve schema descriptions
Use Basic mode if reviews aren’t needed

Reviews Never Complete

Problem: Polling times out waiting for reviewSolutions:

Increase timeout to match your review SLA
Implement email notifications to reviewers
Check review queue isn’t backlogged
Consider async processing instead of blocking

Next Steps

Upload Documents

Detailed upload endpoint documentation

Extract Data

Complete extraction API reference

Polling Pattern

Robust polling implementation guide

Automation Patterns

Production-ready automation examples

Getting Started

Extraction Workflow

Review Workflow

Data Endpoints

Integration Patterns

Complete Extraction Workflow

Phase 1: Document Upload

Phase 2: Schema Definition

Phase 3: Data Extraction

Mode Selection Decision Tree

Basic Extraction Example

Advanced Extraction Example

Phase 4: Review Workflow

Phase 5: Data Processing

Complete Example

Troubleshooting

Next Steps

Upload Documents

Extract Data

Polling Pattern

Automation Patterns

Getting Started

Extraction Workflow

Review Workflow

Data Endpoints

Integration Patterns

​Complete Extraction Workflow

​Phase 1: Document Upload

​Phase 2: Schema Definition

​Phase 3: Data Extraction

​Mode Selection Decision Tree

​Basic Extraction Example

​Advanced Extraction Example

​Phase 4: Review Workflow

​Phase 5: Data Processing

​Complete Example

​Troubleshooting

​Next Steps

Upload Documents

Extract Data

Polling Pattern

Automation Patterns

Complete Extraction Workflow

Phase 1: Document Upload

Phase 2: Schema Definition

Phase 3: Data Extraction

Mode Selection Decision Tree

Basic Extraction Example

Advanced Extraction Example

Phase 4: Review Workflow

Phase 5: Data Processing

Complete Example

Troubleshooting

Next Steps